top of page

R Studio - Text Analytics - Corpus Inspection


There are several areas in R Studio Text Analytics, which can be examined. One of the main tasks is to explore the text Corpus. I use the file with texts of political speeches of presidents, for the purpose of the political communication examination. This basic part of code will provide the most frequent word in text, which is a 'will'.



Corpus inspection steps:

  1. Installation of packages and Data Load

  2. Data Preparation and selection

  3. Corpus visualisation







1.

## install packages


install.packages("tm")

install.packages("corpus")

install.packages("textplot_wordcloud")

install.packages("text2vec")

install.packages( "readtext" )

install.packages("quanteda")

install.packages("quanteda.textplots")

install.packages("quanteda.textstats")



library(readr)


## import data from csv. and save csv to mac desktop


d = read_csv("~/Desktop/Political_speaches.csv")


## view first 6 rows


head(d)



2.

## Load Libraries

library("tm")

library("SnowballC")

library("wordcloud")

library("RColorBrewer")


setwd('~/Desktop/Political_speaches.csv')


text <- readLines("~/Desktop/Political_speaches.csv")


## Load the data as a corpus


docs <- Corpus(VectorSource(text))


## Convert all text to lower case


docs <- tm_map(docs, content_transformer(tolower))


## Remove punctuations


docs <- tm_map(docs, removePunctuation)


docs <- tm_map(docs, removeNumbers)


docs <- tm_map(docs, removeWords, stopwords("english"))


docs <- tm_map(docs, removeWords, c("big", "small"))


docs <- tm_map(docs, stemDocument)


stopwords(kind = "en")


docs <- tm_map(docs, toSpace, "\\|")


docs <- tm_map(docs, stripWhitespace)


doc_mat <- TermDocumentMatrix(docs)


m <- as.matrix(doc_mat)


v <- sort(rowSums(m), decreasing = TRUE)


d_Rcran <- data.frame(word = names(v), freq = v)


head(d_Rcran, 5)


word freq

will will 11137

state state 9202

govern govern 8515

year year 7240

nation nation 6660


3.

## Word distribution


wordcloud(words = d_Rcran$word,freq = d_Rcran$freq, min.freq = 1,max.words = 100, random.order = FALSE,rot.per = 0.0, colors = brewer.pal(4, "Set1"))




Recent Posts

See All

R Studio - Missing values

It is possible to use mean, mean for particular category, median for discrete variables, one of many regressions, neural networks,...

Comments


bottom of page