Python - Binary Bootstrap Sample

May 21, 2021

The code provides Yes/No prediction, fitting binary bootstrap sample for mistakes in data entries. Leo Breiman was (1928 – 2005) statistician, University of California, Berkeley. Bridge between statistics and computer science, in machine learning. E.g. classification and regression trees, ensembles of trees, bootstrap samples random forests, .... Bootstrap aggregation is called bagging by Breiman.

category_encoders - set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.

ce.CountEncoder()

https://contrib.scikit-learn.org/category_encoders/index.html

sklearn.model_selection - Model_selection method for setting a blueprint to analyze data and then using it to measure new data. Selecting a proper model allows to generate accurate results when making a prediction. To do that, you need to train your model by using a specific dataset.

sklearn.ensemble - The sklearn.ensemble module includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method.

https://scikit-learn.org/stable/modules/ensemble.html

roc_auc_score - Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

rfc = RandomForestClassifier() - A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Receiver Operating Characteristic Curve - A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.

# Count mistaken entries - import Y/N data file with mistakes in Job Web Advertisements, information if there is or is not a mistake in online advertisement is in column called publishers as fraudulent

https://medium.com/analytics-vidhya/detecting-fake-job-postings-with-the-random-forest-model-c96493108901. .

https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction

from google.colab import drive

drive.mount('/content/gdrive’)

import pandas as pd

df=pd.read_csv('gdrive/MyDrive/Dataset_for_Colab/fake_job_postings.csv')

# Count mistaken entries

count_mistakes = df.groupby('fraudulent').count()

count_mistakes.reset_index(inplace=True)

count_mistakes

# 4.8% of data has 1 in column fraudulent

# Plot counts

import plotly.express as px

fig = px.bar(count_mistakes, x='fraudulent', y='job_id',labels={'job_id': 'count'})

fig.show()

# Plot counts by top 10 job titles

fig = px.bar(count_mistakes.iloc[:10],

x='title', y='fraudulent',

labels={'fraudulent': 'count'})

fig.show()

# Extract into boolean columns job titles that contain 'home' or 'remote'

df['home'] = df['title'].str.contains('home')

df['remote'] = df['title'].str.contains('remote’)

# Check how many titles have 'home' or ‘remote'

print(df['home'].value_counts())

print(df['remote'].value_counts())

print(df['telecommuting'].value_counts())

False 17875

True 5

Name: home, dtype: int64

False 17851

True 29

Name: remote, dtype: int64

0 17113

1 767

Name: telecommuting, dtype: int64

!pip install category_encoders

import category_encoders as ce

count_enc = ce.CountEncoder()

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score

df2 = df.drop(columns='fraudulent')

df2

# Split into train and test datasets

X_train, X_test, y_train, y_test = train_test_split(df2,df['fraudulent'],test_size=0.3,random_state=12)

rfc = RandomForestClassifier()

rfc.fit(X_train, y_train)

predictions = rfr.predict(X_test)

score = roc_auc_score(y_test, predictions)

print('Score: {}'.format(score))

From this model, we get a score of 94% for area under the ROC curve. Very good prediction.

Sarka Pribylova

Email: sarka.pribylova@gmail.com

Python - Binary Bootstrap Sample

Recent Posts

Comments

Sarka Pribylova Email: sarka.pribylova@gmail.com

Python - Binary Bootstrap Sample

Recent Posts

Comments

Sarka Pribylova

Email: sarka.pribylova@gmail.com