top of page

Python - Classification - Random Forest

Many years ago I used the Random Forest methodology in R Studio to calculate the feature importance in Coffee Data Set. Random forests are being used already since 2006. In this post I sort the feature importance not in R Studio, but in Python.


import pandas as pd


# upload the provided csv file

df = pd.read_csv('coffee_cleaned_RF.csv')


df


# Basic statistical measures for each metrics

df.describe()


df.dtypes


df.shape


duplicate_rows_df = df[df.duplicated()]

print("number of duplicate rows: ", duplicate_rows_df.shape)


df.count()


print(df.isnull().sum())


coffee_df = df[['total_cup_points',

'species',

'country_of_origin',

'variety',

'aroma',

'aftertaste',

'acidity',

'body',

'balance',

'sweetness',

'altitude_mean_meters',

'moisture']]

coffee_df = coffee_df.dropna()


coffee_df


from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder()

coffee_df["species"] = ord_enc.fit_transform(coffee_df[["species"]])

coffee_df["country_of_origin"] = ord_enc.fit_transform(coffee_df[["country_of_origin"]])

coffee_df["variety"] = ord_enc.fit_transform(coffee_df[["variety"]])


pip install seaborn



# we can examine for example metric total_cup_points and find which category variables have the most impact


import seaborn

seaborn.pairplot(coffee_df.drop('total_cup_points', axis = 1))


coffee_df = coffee_df[coffee_df['aroma']>0]

coffee_df = coffee_df[coffee_df['acidity']>0]

seaborn.pairplot(coffee_df)


seaborn.heatmap(coffee_df.corr(), xticklabels=coffee_df.columns, yticklabels=coffee_df.columns)


pip install numpy


# Create n rating segmentation with values 1 /low/,2 /average/ and 3 /good/


import numpy as np


rating_pctile = np.percentile(coffee_df['total_cup_points'], [75, 90])


rating_pctile


coffee_df['n_rating'] = 0

coffee_df['n_rating'] = np.where(coffee_df['total_cup_points'] < rating_pctile[0], 1, coffee_df['n_rating'])

coffee_df['n_rating'] = np.where((coffee_df['total_cup_points'] >= rating_pctile[0]) & (coffee_df['total_cup_points'] <= rating_pctile[1]), 2, coffee_df['n_rating'])

coffee_df['n_rating'] = np.where(coffee_df['total_cup_points'] > rating_pctile[1], 3, coffee_df['n_rating'])


X = coffee_df.drop(['total_cup_points', 'n_rating', 'sweetness', 'species', 'altitude_mean_meters'], axis = 1)

y = coffee_df['n_rating']


# n rating column has been created

coffee_df


# split the data X_train, X_test, y_train, y_test

# X_train, X_test, y_train, y_test = model_selection.train_test_split(.....

import sklearn.model_selection as model_selection


training, testing, training_labels, testing_labels = model_selection.train_test_split(X, y, test_size = .25, random_state = 42)


from numpy import asarray

from sklearn.preprocessing import MinMaxScaler


from sklearn.preprocessing import StandardScaler


# Normalize the data

sc = StandardScaler()

normed_train_data = pd.DataFrame(sc.fit_transform(training), columns = X.columns)

normed_test_data = pd.DataFrame(sc.fit_transform(testing), columns = X.columns)


from sklearn.ensemble import RandomForestClassifier


clf=RandomForestClassifier()

clf.fit(training, training_labels)


preds = clf.predict(testing)


# The model scores 100% accuracy on the training data and a lower 86.75% on the testing data.

# High - safe to say the model is overfitting — the model is modeling the training data too well and not generalizing what it’s learning.


print (clf.score(training, training_labels))

print(clf.score(testing, testing_labels))


import sklearn


from sklearn.metrics import confusion_matrix


# 1st column - coffees with a ‘low’ rating

# 2nd column - coffees with an ‘average’ rating

# 3rd column - coffee with a ‘good’ rating

# Numbers on the diagonal 185, 19, and 14 - the number of coffees the model has accurately classified

# 11 coffees - supposed to be marked as ‘low’ scoring coffees were marked ‘average’


sklearn.metrics.confusion_matrix(testing_labels, preds, labels = [1, 2, 3])


array

185 4 1

11 19 2

0 13 14



# aftertaste , balance and acidity contribute the most to total cup points

pd.DataFrame(clf.feature_importances_, index=training.columns).sort_values(by=0, ascending=False)


aftertaste 0.221273

balance 0.208144

acidity 0.188507

aroma 0.139205

body 0.109962

country_of_origin 0.049313

variety 0.043609

moisture 0.039987











https://towardsdatascience.com/random-forest-in-python-24d0893d51c0 https://machinelearningmastery.com/calculate-feature-importance-with-python/ https://towardsdatascience.com/3-essential-ways-to-calculate-feature-importance-in-python-2f9149592155 https://medium.com/swlh/feature-importance-hows-and-why-s-3678ede1e58f https://towardsdatascience.com/machine-learning-with-python-regression-complete-tutorial-47268e546cea https://www.analyticsvidhya.com/blog/2021/04/k-means-clustering-simplified-in-python/ https://realpython.com/k-means-clustering-python/ https://www.datacamp.com/community/tutorials/random-forests-classifier-python https://scikit-learn.org/stable/modules/cross_validation.html https://campus.datacamp.com/courses/customer-analytics-and-ab-testing-in-python/key-performance-indicators-measuring-business-success?ex=10 https://towardsdatascience.com/a-practical-guide-to-implementing-a-random-forest-classifier-in-python-979988d8a263 https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-07/readme.md https://github.com/jldbc/coffee-quality-database https://medium.com/@julie.yin/understanding-the-data-splitting-functions-in-scikit-learn-9ae4046fbd26 https://seaborn.pydata.org/tutorial/color_palettes.html https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce https://medium.com/wids-mysore/handling-missing-values-82ce096c0cef https://en.wikipedia.org/wiki/Anomaly_detection https://scikit-learn.org/stable/modules/outlier_detection.html

10 views0 comments

Recent Posts

See All

Python - Basic regression comparison

Regression models are the principles of machine learning models as well. They help to understand the dataset distributions. The objective...

Comentários

Não foi possível carregar comentários
Parece que houve um problema técnico. Tente reconectar ou atualizar a página.
bottom of page