Python - Classification - Random Forest

Jan 28, 2022

Many years ago I used the Random Forest methodology in R Studio to calculate the feature importance in Coffee Data Set. Random forests are being used already since 2006. In this post I sort the feature importance not in R Studio, but in Python.

import pandas as pd

# upload the provided csv file

df = pd.read_csv('coffee_cleaned_RF.csv')

# Basic statistical measures for each metrics

df.describe()

df.dtypes

df.shape

duplicate_rows_df = df[df.duplicated()]

print("number of duplicate rows: ", duplicate_rows_df.shape)

df.count()

print(df.isnull().sum())

coffee_df = df[['total_cup_points',

'species',

'country_of_origin',

'variety',

'aroma',

'aftertaste',

'acidity',

'body',

'balance',

'sweetness',

'altitude_mean_meters',

'moisture']]

coffee_df = coffee_df.dropna()

coffee_df

from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder()

coffee_df["species"] = ord_enc.fit_transform(coffee_df[["species"]])

coffee_df["country_of_origin"] = ord_enc.fit_transform(coffee_df[["country_of_origin"]])

coffee_df["variety"] = ord_enc.fit_transform(coffee_df[["variety"]])

pip install seaborn

# we can examine for example metric total_cup_points and find which category variables have the most impact

import seaborn

seaborn.pairplot(coffee_df.drop('total_cup_points', axis = 1))

coffee_df = coffee_df[coffee_df['aroma']>0]

coffee_df = coffee_df[coffee_df['acidity']>0]

seaborn.pairplot(coffee_df)

seaborn.heatmap(coffee_df.corr(), xticklabels=coffee_df.columns, yticklabels=coffee_df.columns)

pip install numpy

# Create n rating segmentation with values 1 /low/,2 /average/ and 3 /good/

import numpy as np

rating_pctile = np.percentile(coffee_df['total_cup_points'], [75, 90])

rating_pctile

coffee_df['n_rating'] = 0

coffee_df['n_rating'] = np.where(coffee_df['total_cup_points'] < rating_pctile[0], 1, coffee_df['n_rating'])

coffee_df['n_rating'] = np.where((coffee_df['total_cup_points'] >= rating_pctile[0]) & (coffee_df['total_cup_points'] <= rating_pctile[1]), 2, coffee_df['n_rating'])

coffee_df['n_rating'] = np.where(coffee_df['total_cup_points'] > rating_pctile[1], 3, coffee_df['n_rating'])

X = coffee_df.drop(['total_cup_points', 'n_rating', 'sweetness', 'species', 'altitude_mean_meters'], axis = 1)

y = coffee_df['n_rating']

# n rating column has been created

coffee_df

# split the data X_train, X_test, y_train, y_test

# X_train, X_test, y_train, y_test = model_selection.train_test_split(.....

import sklearn.model_selection as model_selection

training, testing, training_labels, testing_labels = model_selection.train_test_split(X, y, test_size = .25, random_state = 42)

from numpy import asarray

from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import StandardScaler

# Normalize the data

sc = StandardScaler()

normed_train_data = pd.DataFrame(sc.fit_transform(training), columns = X.columns)

normed_test_data = pd.DataFrame(sc.fit_transform(testing), columns = X.columns)

from sklearn.ensemble import RandomForestClassifier

clf=RandomForestClassifier()

clf.fit(training, training_labels)

preds = clf.predict(testing)

# The model scores 100% accuracy on the training data and a lower 86.75% on the testing data.

# High - safe to say the model is overfitting — the model is modeling the training data too well and not generalizing what it’s learning.

print (clf.score(training, training_labels))

print(clf.score(testing, testing_labels))

import sklearn

from sklearn.metrics import confusion_matrix

# 1st column - coffees with a ‘low’ rating

# 2nd column - coffees with an ‘average’ rating

# 3rd column - coffee with a ‘good’ rating

# Numbers on the diagonal 185, 19, and 14 - the number of coffees the model has accurately classified

# 11 coffees - supposed to be marked as ‘low’ scoring coffees were marked ‘average’

sklearn.metrics.confusion_matrix(testing_labels, preds, labels = [1, 2, 3])

array

185 4 1

11 19 2

0 13 14

# aftertaste , balance and acidity contribute the most to total cup points

pd.DataFrame(clf.feature_importances_, index=training.columns).sort_values(by=0, ascending=False)

aftertaste 0.221273

balance 0.208144

acidity 0.188507

aroma 0.139205

body 0.109962

country_of_origin 0.049313

variety 0.043609

moisture 0.039987

https://towardsdatascience.com/random-forest-in-python-24d0893d51c0 https://machinelearningmastery.com/calculate-feature-importance-with-python/ https://towardsdatascience.com/3-essential-ways-to-calculate-feature-importance-in-python-2f9149592155 https://medium.com/swlh/feature-importance-hows-and-why-s-3678ede1e58f https://towardsdatascience.com/machine-learning-with-python-regression-complete-tutorial-47268e546cea https://www.analyticsvidhya.com/blog/2021/04/k-means-clustering-simplified-in-python/ https://realpython.com/k-means-clustering-python/ https://www.datacamp.com/community/tutorials/random-forests-classifier-python https://scikit-learn.org/stable/modules/cross_validation.html https://campus.datacamp.com/courses/customer-analytics-and-ab-testing-in-python/key-performance-indicators-measuring-business-success?ex=10 https://towardsdatascience.com/a-practical-guide-to-implementing-a-random-forest-classifier-in-python-979988d8a263 https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-07/readme.md https://github.com/jldbc/coffee-quality-database https://medium.com/@julie.yin/understanding-the-data-splitting-functions-in-scikit-learn-9ae4046fbd26 https://seaborn.pydata.org/tutorial/color_palettes.html https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce https://medium.com/wids-mysore/handling-missing-values-82ce096c0cef https://en.wikipedia.org/wiki/Anomaly_detection https://scikit-learn.org/stable/modules/outlier_detection.html

Sarka Pribylova

Email: sarka.pribylova@gmail.com

Python - Classification - Random Forest

Recent Posts

Comentários

Sarka Pribylova Email: sarka.pribylova@gmail.com

Python - Classification - Random Forest

Recent Posts

Comentários

Sarka Pribylova

Email: sarka.pribylova@gmail.com