There are several areas in R Studio Text Analytics, which can be examined. One of the main tasks is to explore the text Corpus. I use the file with texts of political speeches of presidents, for the purpose of the political communication examination. This basic part of code will provide the most frequent word in text, which is a 'will'.
Corpus inspection steps:
Installation of packages and Data Load
Data Preparation and selection
Corpus visualisation
1.
## install packages
install.packages("tm")
install.packages("corpus")
install.packages("textplot_wordcloud")
install.packages("text2vec")
install.packages( "readtext" )
install.packages("quanteda")
install.packages("quanteda.textplots")
install.packages("quanteda.textstats")
library(readr)
## import data from csv. and save csv to mac desktop
d = read_csv("~/Desktop/Political_speaches.csv")
## view first 6 rows
head(d)
2.
## Load Libraries
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
setwd('~/Desktop/Political_speaches.csv')
text <- readLines("~/Desktop/Political_speaches.csv")
## Load the data as a corpus
docs <- Corpus(VectorSource(text))
## Convert all text to lower case
docs <- tm_map(docs, content_transformer(tolower))
## Remove punctuations
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("big", "small"))
docs <- tm_map(docs, stemDocument)
stopwords(kind = "en")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, stripWhitespace)
doc_mat <- TermDocumentMatrix(docs)
m <- as.matrix(doc_mat)
v <- sort(rowSums(m), decreasing = TRUE)
d_Rcran <- data.frame(word = names(v), freq = v)
head(d_Rcran, 5)
word freq
will will 11137
state state 9202
govern govern 8515
year year 7240
nation nation 6660
3.
## Word distribution
wordcloud(words = d_Rcran$word,freq = d_Rcran$freq, min.freq = 1,max.words = 100, random.order = FALSE,rot.per = 0.0, colors = brewer.pal(4, "Set1"))
Comments