Python - Score comparison and feature importance

Sep 17, 2022

The purpose of this contribution is to compare modeling and scoring methodologies for given E-Commerce open source sample file, select the best performing methodology and top 10 features having impact on Customer churn.

Compared models: random forest, XGB, KNN, Catboost, ...

Compared scores:recall, precision, f1, ROC, balanced accuracy, TNs, TPs, ...

In this case the best performance provided tree models, Catboost tree model hyperparameters with f1 train score 0.98 outperformed other models. That is why we used Catboost to calculate the top features having impact on Customer Churn, which appeared to be: Tenure, Complain and Number of Addresses.

1. Theory

adaboost - is adaptive Boosting used in Machine learning as ensemble method, is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995. The most often boosting algorithms are AdaBoost, Gradient descent, Xtreme. Boosting algorithms are combining multiple models, e.g. KNN and Linear regression. Correct and wrong predictions have different weights.

arbitrary encoding method - encoding method in ordinal encoding - categories are numbered arbitrarily, they only have relative meaning, not absolute meaning.

attrition rate - Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers.

balanced accuracy score - The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class. balanced accuracy score = (Sensitivity + Specificity) / 2 = (TP + TN) / 2. Use balanced accuracy score when the model isn't just about mapping to (0,1) outcome but providing a wide range of possible outcomes (probability).

ball tree - In computer science, a ball tree, balltree or metric tree, is a space partitioning data structure for organising points in a multi-dimensional space.

bootstrap - Bootstrap = False is telling it to sample observations with or without replacement - it should still sample when it's False, just without replacement.

brute - Machine learning is not, by any means, brute force. Brute force means trying each option and choosing the best fit. It is one of many hyperparameters used e.g. as algorithm in KNeighborsClassifier.

catboost classifier - CatBoost is an algorithm for gradient boosting on decision trees. It is developed by Yandex researchers and engineers, and is used for search, recommendation systems, personal assistant, self-driving cars, weather prediction and many other tasks at Yandex and in other companies, including CERN, Cloudflare, Careem taxi.

categorical encoding - one hot encoder, count frequency encoder, ordinal encoder, mean encoder, woEEncoder, PRatioEncoder, DecisionTreeEncoder, RareLabelEncoder, ...

clf - We call our estimator instance clf, as it is a classifier.

customer churn - The churn rate, also known as the rate of attrition or customer churn, is the rate at which customers stop doing business with an entity.

CV - CV just means cross validation. Its a way of using all of your available training data to inform your model, while also using that data to make predictions.

f1 score - combination of recall and precision score, f1 score is better addressing imbalanced datasets. F1 score combines precision and recall into one metric by calculating the harmonic mean between those two. It is actually a special case of the more general function F beta. f1 = 2*(precision*recall)/(precision+recall)= interval between 0 and 1. F1-score stays low when one of the two inputs (Precision / Recall) is low.

Feature-engine - Python library with multiple transformers to engineer features for use in machine learning models. Feature-engine preserves Scikit-learn functionality with methods fit() and transform() to learn parameters from and then transform the data.

gini - a measure of statistical dispersion intended to represent the income inequality or the wealth inequality within a nation or a social group. The Gini coefficient was developed by the statistician and sociologist Corrado Gini.

gridsearchcv - GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters. It is basically a cross-validation method. the model and the parameters are required to be fed in. Best parameter values are extracted and then the predictions are made.

kd tree - In computer science, a k-d tree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space. k-d trees are a useful data structure for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches) and creating point clouds. k-d trees are a special case of binary space partitioning trees.

KNN -In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametricsupervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951,[1] and later expanded by Thomas Cover.[2] It is used for classification and regression. In both cases, the input consists of the kclosest training examples in a data set.

logistic regression - In statistics, the (binary) logistic model (or logit model) is a statistical model that models the probability of one event (out of two alternatives) taking place by having the log-odds (the logarithm of the odds) for the event be a linear combination of one or more independent variables ("predictors"). Logistic Regression (aka logit, MaxEnt) classifier. In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.) This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ solvers. Note that regularization is applied by default. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

owen value - Owen value is an extension of Shapley value for cooperative games when a particular coalition structure or partition of the set of players is considered in addition.

permutation explainer - The Permutation explainer is model-agnostic, so it can compute Shapley values and Owen values for any model.

precision (sensitivity) - Accuracy is the degree of closeness between a measurement and its true value. Precision is the degree to which repeated measurements under the same conditions show the same results. There are 3 types of precision estimate, namely repeatability, intermediate precision (or intermediate repeatability) and reproducibility.

random forest classifier - A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

random state - The point of the random state is so that the train_test_split will return the same split each time, giving consistency to your model. For example, say you don't set a random state (or you set it to None). Doing this means that every time you run your model, the split will occur, and because the random state wasn't set the split will be different every time. In other words, the train and test sets won't always be the same - they will have different values in them.

RandomizedSearchCV - randomly passes the set of hyperparameters and calculate the score and gives the best set of hyperparameters which gives the best score as an output. RandomizedSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

recall - model's ability to correctly predict the positives out of actual positives. This is unlike precision which measures how many predictions made by models are actually positive out of all positive predictions made.

recall score - In information retrieval, a perfect precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved) whereas a perfect recall score of 1.0 means that all relevant documents were retrieved by the search.

roc auc curve - AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease.

roc auc score - Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores. In general, an AUC of 0.5 suggests no discrimination (i.e., ability to diagnose patients with and without the disease or condition based on the test), 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and more than 0.9 is considered outstanding. An AUC of 0.75 would actually mean that lets say we take two data points belonging to separate classes then there is 75% chance model would be able to segregate them or rank order them correctly i.e positive point has a higher prediction probability than the negative class.

SAMME - special algorithm used e.g. in AdaBoost.

screen time - Sanders et al., (2019) outline five different types of screen time: Social, Passive, Interactive, Educational, and Other. Each of these areas have specific contexts for how the screen is being used.

scoring - quantifying the quality of predictions.

perfect recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved).

perfect precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved).

AUC above 0.85 means high classification accuracy, one between 0.75 and 0.85 moderate accuracy, and one less than 0.75 low accuracy (D' Agostino, Rodgers, & Mauck, 2018).

perfect model has f1 score equal to 1.0, if recall or precision are 0, than f1 is the worst case of 0.

balanced accuracy score, (TP plus TN)/2, over 0.9 is very good, between 0.7 and 0.9 is good.

Sensitivity - TP - “true positive rate” – the percentage of positive cases the model is able to detect.

shap - Shapley Additive Explanations, break down a prediction to show the impact of each feature.

shap values - If you use direct python shap library, the color pink indicates a high feature value in the data set and color bluerepresents a low feature value.

shapley values - The Shapley value is a solution concept in cooperative game theory. It was named in honor of Lloyd Shapley, who introduced it in 1951 and won the Nobel Memorial Prize in Economic Sciences for it in 2012. To each cooperative game it assigns a unique distribution (among the players) of a total surplus generated by the coalition of all players. The Shapley value is characterized by a collection of desirable properties.

skimage - Skimage provides easy-to-use functions for reading, displaying, and saving images.

Specificity - TN - “true negative rate” – the percentage of negative cases the model is able to detect.

uniform - A Uniform Distribution is a distribution in which there equal probabilities across all the values in the set. Also known as the continuous uniform.

XGBoost - XGBoost, which stands for Extreme Gradient Boosting, is a scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems.

XGB classifier - Machine Learning model to fit the data, in other words boost model for classification.

2. Python code

3. References

https://medium.com/towards-data-science/keyword-extraction-with-bert-724efca412ea

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

https://www.freecodecamp.org/news/scraping-ecommerce-website-with-python/

https://www.analyticsvidhya.com/blog/2018/09/deep-learning-video-classification-python/

https://towardsdatascience.com/generate-any-sport-highlights-using-python-3695c98baead

https://medium.datadriveninvestor.com/analyzing-video-using-python-opencv-and-numpy-5471cab200c4

https://pypi.org/project/videoanalytics/

https://en.wikipedia.org/wiki/Customer_attrition

https://towardsdatascience.com/data-science-for-e-commerce-with-python-a0a97dd7721d

https://www.kaggle.com/code/izzuddin8803/notebook702220566a

https://www.kaggle.com/datasets?search=ecommerce

https://medium.com/@shrestha_angel/explain-machine-learning-models-through-shap-98aedd19164c

https://towardsai.net/p/l/ecommerce-data-analysis-for-sales-strategy-using-python

https://drive.google.com/drive/folders/1J2xPJn14Koe-yXASUTpk4cjYNb4NuZ5Q

https://en.wikipedia.org/wiki/AdaBoost

https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc

https://www.statology.org/balanced-accuracy-python-sklearn/

https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/explainers/Permutation.html

https://en.m.wikipedia.org/wiki/Shapley_value

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html

https://en.wikipedia.org/wiki/Gini_coefficient

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://link.springer.com/article/10.1007/s10100-009-0100-8

https://datacarpentry.org/image-processing/03-skimage-images/

https://en.m.wikipedia.org/wiki/Precision_and_recall

https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec

https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/amp/

https://scikit-learn.org/stable/modules/model_evaluation.html

https://www.kaggle.com/general/275064

https://www.tc.columbia.edu/elda/blog/content/receiver-operating-characteristic-roc-area-under-the-curve-auc/

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

https://en.wikipedia.org/wiki/K-d_tree

https://towardsdatascience.com/why-do-we-set-a-random-state-in-machine-learning-models-bb2dc68d8431

https://consultglp.com/wp-content/uploads/2019/02/Types-of-precision-estimates.pdf

https://catboost.ai/

https://pbpython.com/categorical-encoding.html

https://www.analyticsvidhya.com/blog/2021/09/adaboost-algorithm-a-complete-guide-for-beginners/

https://feature-engine.readthedocs.io/en/1.1.x/encoding/OrdinalEncoder.html

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Sarka Pribylova

Email: sarka.pribylova@gmail.com

Python - Score comparison and feature importance

Recent Posts

コメント

Sarka Pribylova Email: sarka.pribylova@gmail.com

Python - Score comparison and feature importance

Recent Posts

コメント

Sarka Pribylova

Email: sarka.pribylova@gmail.com