top of page

Python - Binary Bootstrap Sample

The code provides Yes/No prediction, fitting binary bootstrap sample for mistakes in data entries. Leo Breiman was (1928 – 2005) statistician, University of California, Berkeley. Bridge between statistics and computer science, in machine learning. E.g. classification and regression trees, ensembles of trees, bootstrap samples random forests, .... Bootstrap aggregation is called bagging by Breiman.



category_encoders - set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.

ce.CountEncoder()


https://contrib.scikit-learn.org/category_encoders/index.html


sklearn.model_selection - Model_selection method for setting a blueprint to analyze data and then using it to measure new data. Selecting a proper model allows to generate accurate results when making a prediction. To do that, you need to train your model by using a specific dataset.



sklearn.ensemble - The sklearn.ensemble module includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method.


https://scikit-learn.org/stable/modules/ensemble.html



roc_auc_score - Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.


https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html



rfc = RandomForestClassifier() - A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.


https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


Receiver Operating Characteristic Curve - A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.






# Count mistaken entries - import Y/N data file with mistakes in Job Web Advertisements, information if there is or is not a mistake in online advertisement is in column called publishers as fraudulent



from google.colab import drive


drive.mount('/content/gdrive’)


import pandas as pd


df=pd.read_csv('gdrive/MyDrive/Dataset_for_Colab/fake_job_postings.csv')


df



# Count mistaken entries


count_mistakes = df.groupby('fraudulent').count()


count_mistakes.reset_index(inplace=True)


count_mistakes


# 4.8% of data has 1 in column fraudulent


# Plot counts


import plotly.express as px


fig = px.bar(count_mistakes, x='fraudulent', y='job_id',labels={'job_id': 'count'})

fig.show()



# Plot counts by top 10 job titles


fig = px.bar(count_mistakes.iloc[:10],

x='title', y='fraudulent',

labels={'fraudulent': 'count'})

fig.show()



# Extract into boolean columns job titles that contain 'home' or 'remote'


df['home'] = df['title'].str.contains('home')


df['remote'] = df['title'].str.contains('remote’)


# Check how many titles have 'home' or ‘remote'


print(df['home'].value_counts())


print(df['remote'].value_counts())


print(df['telecommuting'].value_counts())


False 17875

True 5

Name: home, dtype: int64

False 17851

True 29

Name: remote, dtype: int64

0 17113

1 767

Name: telecommuting, dtype: int64



!pip install category_encoders


import category_encoders as ce


count_enc = ce.CountEncoder()


from sklearn.model_selection import train_test_split


from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import roc_auc_score



df


df2 = df.drop(columns='fraudulent')


df2


# Split into train and test datasets


X_train, X_test, y_train, y_test = train_test_split(df2,df['fraudulent'],test_size=0.3,random_state=12)


rfc = RandomForestClassifier()


rfc.fit(X_train, y_train)


predictions = rfr.predict(X_test)


score = roc_auc_score(y_test, predictions)


print('Score: {}'.format(score))


From this model, we get a score of 94% for area under the ROC curve. Very good prediction.

25 views0 comments

Recent Posts

See All

Comments


bottom of page