Many years ago I used the Random Forest methodology in R Studio to calculate the feature importance in Coffee Data Set. Random forests are being used already since 2006. In this post I sort the feature importance not in R Studio, but in Python.
import pandas as pd
# upload the provided csv file
df = pd.read_csv('coffee_cleaned_RF.csv')
df
# Basic statistical measures for each metrics
df.describe()
df.dtypes
df.shape
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)
df.count()
print(df.isnull().sum())
coffee_df = df[['total_cup_points',
'species',
'country_of_origin',
'variety',
'aroma',
'aftertaste',
'acidity',
'body',
'balance',
'sweetness',
'altitude_mean_meters',
'moisture']]
coffee_df = coffee_df.dropna()
coffee_df
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
coffee_df["species"] = ord_enc.fit_transform(coffee_df[["species"]])
coffee_df["country_of_origin"] = ord_enc.fit_transform(coffee_df[["country_of_origin"]])
coffee_df["variety"] = ord_enc.fit_transform(coffee_df[["variety"]])
pip install seaborn
# we can examine for example metric total_cup_points and find which category variables have the most impact
import seaborn
seaborn.pairplot(coffee_df.drop('total_cup_points', axis = 1))
coffee_df = coffee_df[coffee_df['aroma']>0]
coffee_df = coffee_df[coffee_df['acidity']>0]
seaborn.pairplot(coffee_df)
seaborn.heatmap(coffee_df.corr(), xticklabels=coffee_df.columns, yticklabels=coffee_df.columns)
pip install numpy
# Create n rating segmentation with values 1 /low/,2 /average/ and 3 /good/
import numpy as np
rating_pctile = np.percentile(coffee_df['total_cup_points'], [75, 90])
rating_pctile
coffee_df['n_rating'] = 0
coffee_df['n_rating'] = np.where(coffee_df['total_cup_points'] < rating_pctile[0], 1, coffee_df['n_rating'])
coffee_df['n_rating'] = np.where((coffee_df['total_cup_points'] >= rating_pctile[0]) & (coffee_df['total_cup_points'] <= rating_pctile[1]), 2, coffee_df['n_rating'])
coffee_df['n_rating'] = np.where(coffee_df['total_cup_points'] > rating_pctile[1], 3, coffee_df['n_rating'])
X = coffee_df.drop(['total_cup_points', 'n_rating', 'sweetness', 'species', 'altitude_mean_meters'], axis = 1)
y = coffee_df['n_rating']
# n rating column has been created
coffee_df
# split the data X_train, X_test, y_train, y_test
# X_train, X_test, y_train, y_test = model_selection.train_test_split(.....
import sklearn.model_selection as model_selection
training, testing, training_labels, testing_labels = model_selection.train_test_split(X, y, test_size = .25, random_state = 42)
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
# Normalize the data
sc = StandardScaler()
normed_train_data = pd.DataFrame(sc.fit_transform(training), columns = X.columns)
normed_test_data = pd.DataFrame(sc.fit_transform(testing), columns = X.columns)
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier()
clf.fit(training, training_labels)
preds = clf.predict(testing)
# The model scores 100% accuracy on the training data and a lower 86.75% on the testing data.
# High - safe to say the model is overfitting — the model is modeling the training data too well and not generalizing what it’s learning.
print (clf.score(training, training_labels))
print(clf.score(testing, testing_labels))
import sklearn
from sklearn.metrics import confusion_matrix
# 1st column - coffees with a ‘low’ rating
# 2nd column - coffees with an ‘average’ rating
# 3rd column - coffee with a ‘good’ rating
# Numbers on the diagonal 185, 19, and 14 - the number of coffees the model has accurately classified
# 11 coffees - supposed to be marked as ‘low’ scoring coffees were marked ‘average’
sklearn.metrics.confusion_matrix(testing_labels, preds, labels = [1, 2, 3])
array
185 4 1
11 19 2
0 13 14
# aftertaste , balance and acidity contribute the most to total cup points
pd.DataFrame(clf.feature_importances_, index=training.columns).sort_values(by=0, ascending=False)
aftertaste 0.221273
balance 0.208144
acidity 0.188507
aroma 0.139205
body 0.109962
country_of_origin 0.049313
variety 0.043609
moisture 0.039987
https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
https://machinelearningmastery.com/calculate-feature-importance-with-python/
https://towardsdatascience.com/3-essential-ways-to-calculate-feature-importance-in-python-2f9149592155
https://medium.com/swlh/feature-importance-hows-and-why-s-3678ede1e58f
https://towardsdatascience.com/machine-learning-with-python-regression-complete-tutorial-47268e546cea
https://www.analyticsvidhya.com/blog/2021/04/k-means-clustering-simplified-in-python/
https://realpython.com/k-means-clustering-python/
https://www.datacamp.com/community/tutorials/random-forests-classifier-python
https://scikit-learn.org/stable/modules/cross_validation.html
https://campus.datacamp.com/courses/customer-analytics-and-ab-testing-in-python/key-performance-indicators-measuring-business-success?ex=10
https://towardsdatascience.com/a-practical-guide-to-implementing-a-random-forest-classifier-in-python-979988d8a263
https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-07/readme.md
https://github.com/jldbc/coffee-quality-database
https://medium.com/@julie.yin/understanding-the-data-splitting-functions-in-scikit-learn-9ae4046fbd26
https://seaborn.pydata.org/tutorial/color_palettes.html
https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce
https://medium.com/wids-mysore/handling-missing-values-82ce096c0cef
https://en.wikipedia.org/wiki/Anomaly_detection
https://scikit-learn.org/stable/modules/outlier_detection.html
Comments