It is possible to use mean, mean for particular category, median for discrete variables, one of many regressions, neural networks, bayesian formalism, decision trees, clustering algorithms. Possible best choices in R to work with numeric variables and replace missing NA data to perform more descriptive statistics would be for example the methodology Hmisc or MICE. Collected data are missing completely at random, which is type MCAR. Acceptable and ideal portion of missing data is 5%. For variables Employee count, Sales, Facebook likes, Wert per visitor, Capital there is missing 40% of data, results of random forest imputation for these variables are therefore not precise.
Hmisc
Hmisc package and methodology is interesting, it automatically recognizes variables, using bootstrapping, flexible additive and predictive models. It is possible to use not mean or median, but function argImpute(), which automatically identifies the variable type and treats them accordingly.
# Install package and load library
> install.packages("Hmisc") > library(Hmisc)
> install.packages("missForest") > library(missForest)
> coffee.mis <- prodNA(coffee, noNA = 0.1) > summary(coffee.mis)
> impute_arg <- aregImpute(~ Emp_count + Sales + KW_search + Position_CH , data = coffee.mis, n.impute = 4)
# See background process: > impute_arg
# See imputed values:
> impute_arg$imputed$Emp_count
# Compare Histograms before and after imputation: > hist(impute_arg$imputed$Emp_count)
> hist(coffee.mis$Emp_count)
> hist(coffee$Emp_count)
Mice
Research is using at the end Mice methodology to impute variables by random forest function.
# Use Mice to imput values:
> Sys.setlocale("LC_ALL", "C")
> coffee <- read.csv("~/Desktop/coffee.csv", header=TRUE) > install.packages("mice")
> library(mice)
> md.pattern(coffee)
> install.packages("VIM")
> library(VIM)
> aggr_plot <- aggr(coffee, col=c('grey','black'), sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
# Black pattern is demonstrating missing values:
# Factorization of variables: > factor_vars <- c('UID','Category','KW','Web','Note', 'FB', 'Headquarter', 'City') > coffee[factor_vars] <- lapply(coffee[factor_vars], function(x) as.factor(x)) # Set seed for random imputation: > set.seed(1983) # The main imputation, so that all iterations can perform: > mice_mod <- mice(coffee[, !names(coffee) %in% c('UID','Category','KW','Web','Note', 'FB', 'Headquarter', 'City')], method='rf') iter imp variable 1 Emp_count Sales KW_search FB_likes Position_CH Impressions Visits Value_visitor Links Site_count Capital Registered 2 Emp_count Sales KW_search FB_likes Position_CH Impressions Visits Value_visitor Links Site_count Capital Registered 3 Emp_count Sales KW_search FB_likes Position_CH Impressions Visits Value_visitor Links Site_count Capital Registered # Describe imputation process. Random forest is the imputation method used: > mice_mod # Copy the completed results of imputation into special output: > mice_output <- complete(mice_mod) > par(mfrow=c(1,2)) # Compare visually histograms: > hist(coffee$Emp_count, freq=F, main='Emp_count: Original Data', col='dodgerblue') > hist(mice_output$Emp_count, freq=F, main='Emp_count: MICE Output', col='dodgerblue') > hist(coffee$Sales, freq=F, main='Sales: Original Data', col='dodgerblue1') > hist(mice_output$Sales, freq=F, main='Sales: MICE Output', col='dodgerblue1') > hist(coffee$KW_search, freq=F, main=' KW_search: Original Data', col='dodgerblue2') > hist(mice_output$KW_search, freq=F, main='KW_search: MICE Output', col='dodgerblue2') > hist(coffee$FB_likes, freq=F, main=' FB_likes: Original Data', col='dodgerblue2') > hist(mice_output$FB_likes, freq=F, main='FB_likes: MICE Output', col='dodgerblue2') > hist(coffee$Position_CH, freq=F, main='Position_CH: Original Data', col='dodgerblue2') > hist(mice_output$Position_CH, freq=F, main='Position_CH: MICE Output', col='dodgerblue2') > hist(coffee$Impressions, freq=F, main='Impressions: Original Data', col='dodgerblue2') > hist(mice_output$Impressions, freq=F, main='Impressions: MICE Output', col='dodgerblue2') > hist(coffee$Visits, freq=F, main='Visits: Original Data', col='dodgerblue2') > hist(mice_output$Visits, freq=F, main='Visits: MICE Output', col='dodgerblue2') > hist(coffee$Value_visitor, freq=F, main='Value_visitor: Original Data', col='dodgerblue2') > hist(mice_output$Value_visitor, freq=F, main='Value_visitor: MICE Output', col='dodgerblue2') > hist(coffee$Value, freq=F, main='Value: Original Data', col='dodgerblue2') > hist(mice_output$Value, freq=F, main='Value: MICE Output', col='dodgerblue2') > hist(coffee$Links, freq=F, main='Links: Original Data', col='dodgerblue2') > hist(mice_output$Links, freq=F, main='Links: MICE Output', col='dodgerblue2') > hist(coffee$Site_count, freq=F, main='Site_count: Original Data', col='dodgerblue2') > hist(mice_output$Site_count, freq=F, main='Site_count: MICE Output', col='dodgerblue2') > hist(coffee$Capital, freq=F, main=Capital: Original Data', col='dodgerblue2') > hist(mice_output$Capital, freq=F, main='Capital: MICE Output', col='dodgerblue2') > hist(coffee$Registered, freq=F, main='Registered: Original Data', col='dodgerblue2') > hist(mice_output$Registered, freq=F, main='Registered: MICE Output', col='dodgerblue2') # After checking histograms before and after imputation, replace original values in coffee sample: > coffee$Emp_count <- mice_output$Emp_count > coffee$Sales <- mice_output$Sales > coffee$KW_search <- mice_output$KW_search > coffee$Position_CH <- mice_output$Position_CH > coffee$Impressions <- mice_output$Impressions > coffee$Visits <- mice_output$Visits > coffee$FB_likes <- mice_output$FB_likes > coffee$Links <- mice_output$Links > coffee$Site_count <- mice_output$Site_count > coffee$Value_visitor <- mice_output$Value_visitor > coffee$Value <- mice_output$Value > coffee$Capital <- mice_output$Capital > coffee$Registered <- mice_output$Registered # Revise that all values are replaced with 0 missing values: > sum(is.na(coffee$Emp_count)) [1] 0 # Field Value has stayed in predictor matrix of mice output not filled and was filled again with help of rf later: > sum(is.na(coffee$Value)) [1] 24
Comments