top of page

Python - Next word prediction - nltk


We can change customer attention in text document. If we use in an article more words, which are focused on group, a reader will focus attention on group behaviour (executive, leader, state, ...) . If we use in an article more words or phrases related to an investment, our readers will focus attention more on future investments (fund, banker, corporations, ...). Example is created in nltk python library.



# import os

import os


# Read wikipedia UBS description from 6.9.2021 (we can use any other article )

base_file = open("ubs.txt", 'rt')

raw_text = base_file.read()

raw_text


'UBS Group AG[nb 1] is a Swiss multinational investment bank and financial services company founded and based in Switzerland. Co-headquartered in the cities of Z\\\'fcrich and Basel, it maintains a presence in all major financial centres as the largest Swiss banking institution and the largest private bank in the world. UBS client services are known for their strict bank\\\'96client confidentiality and culture of banking secrecy.Because of the bank\'s large positions in the Americas, EMEA, and Asia Pacific markets, the Financial Stability Board considers it a global systemically important bank.\\\n\n\n\nUBS was founded in 1862 as the Bank in Winterthur alongside the advent of the Swiss banking industry. During the 1890s, the Swiss Bank Corporation (SBC) was founded, forming a private banking syndicate that expanded, aided by Switzerland\'s international neutrality. In 1912, t ......................


base_file.close()

print("Text read from file : ",raw_text[:200])


Text read from file : UBS Group AG[nb 1] is a Swiss multinational investment bank and financial services company founded and based in Switzerland. Co-headquartered in the cities of Z\'fcrich and Basel, it maintains a prese


pip install nltk

import nltk

pip install nltk.tokenize


#tokenization of tweets would be

from nltk.tokenize import TweetTokenizer

#tokenization of text file would be

from nltk.tokenize import word_tokenize

nltk.download('punkt')

token_list = nltk.word_tokenize(raw_text)


#Replace special characters

token_list2 = [word.replace("'", "") for word in token_list ]

#Remove punctuations

token_list3 = list(filter(lambda token: nltk.tokenize.punkt.PunktToken(token).is_non_punct, token_list2))

#Convert to lower case

token_list4=[word.lower() for word in token_list3 ]


print("\nSample token list : ", token_list4[:10])

Sample token list : ['ubs', 'group', 'ag', 'nb', '1', 'is', 'a', 'swiss', 'multinational', 'investment']

print("\nTotal Tokens : ",len(token_list4))


Total Tokens : 10953


# Create ngrams for text prediction

from nltk.util import ngrams


#Use a sqlite database to store ngrams information

import sqlite3

conn = sqlite3.connect(":memory:")


#table to store first word, second word and count of occurance

conn.execute('''DROP TABLE IF EXISTS NGRAMS''')

conn.execute('''CREATE TABLE NGRAMS

(FIRST TEXT NOT NULL,

SECOND TEXT NOT NULL,

COUNTS INT NOT NULL,

CONSTRAINT PK_GRAMS PRIMARY KEY (FIRST,SECOND));''')


#Generate bigrams

bigrams = ngrams(token_list4,2)


#Store bigrams in DB

for i in bigrams:

insert_str="INSERT INTO NGRAMS (FIRST,SECOND,COUNTS) \

VALUES ('" + i[0] + "','" + i[1] + "',1 ) \

ON CONFLICT(FIRST,SECOND) DO UPDATE SET COUNTS=COUNTS + 1"

conn.execute(insert_str);


#Look at sample data from the table

cursor = conn.execute("SELECT FIRST, SECOND, COUNTS from NGRAMS LIMIT 5")

for gram_row in cursor:

print("FIRST=", gram_row[0], "SECOND=",gram_row[1],"COUNT=",gram_row[2])


# Create prediction


#Function to query DB and find next word


def recommend(str):

nextwords = []

#Find next words, sort them by most occurance

cur_filter = conn.execute("SELECT SECOND from NGRAMS \

WHERE FIRST='" + str + "' \

ORDER BY COUNTS DESC")

#Build a list ordered from most frequent to least frequent next word

for filt_row in cur_filter:

nextwords.append(filt_row[0])

return nextwords


#Recommend for words group and investment

print("Next words for group are: ", recommend("group"))

print("\nNext words for investment are: ", recommend("investment"))




Next words for group are: ['ag', 'leader', 'received', 'also', 'and', 'ceo', 'companies', 'dillon', 'executive', 'ing', 'kengeter', 'of', 's', 'state', 'were']


Next words for investment are: ['bank', 'banking', 'banks', 'bankers', 'bank\\', 'banker', 'fund', 'grade', '265', 'advisory', 'capabilities', 'corporation', 'management', 'managers', 'officer', 'products', 'teams', 'trust', 'trusts']








Resources:



https://realpython.com/intro-to-python-threading/

https://www.tutorialspoint.com/python/python_multithreading.htm

https://www.tutorialspoint.com/python/python_strings.htm

https://towardsai.net/p/data-mining/text-mining-in-python-steps-and-examples-78b3f8fd913b

https://www.nltk.org

https://machinelearningmastery.com/clustering-algorithms-with-python/

https://scikit-learn.org/stable/

https://machinelearningmastery.com/clustering-algorithms-with-python/

https://link.springer.com/article/10.1007/s40595-016-0086-9

https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

https://en.wikipedia.org/wiki/Document_clustering

http://people.scs.carleton.ca/~armyunis/projects/KAPI/porter.pdf

https://www.nltk.org/howto/corpus.html

https://towardsdatascience.com/basic-binary-sentiment-analysis-using-nltk-c94ba17ae386

https://realpython.com/python-nltk-sentiment-analysis/

https://en.wikipedia.org/wiki/Natural_Language_Toolkit

https://perl.developpez.com/documentations/en/5.18.0/index-language.html

https://www.nltk.org/book/ch02.html

https://www.nltk.org/data.html

https://widdowquinn.github.io/Teaching-SWC-Lessons/python/2017-05-18-standrews/extras/nltk_example.html#using

https://www.frontiersin.org/articles/10.3389/fninf.2014.00038/full

https://www.w3schools.com/python/python_dictionaries.asp

https://en.wikipedia.org/wiki/Tuple

https://www.researchgate.net/figure/Modes-and-arenas-of-political-communication-according-to-Habermas_fig4_251442610

https://www.slideshare.net/nadianaseem5/the-study-of-political-communication

https://thecodex.me/blog/sentiment-analysis-tool-for-stock-trading

https://finviz.com

https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)

https://www.crummy.com/software/BeautifulSoup/

https://github.com/TheCodex-Me/Projects/blob/master/Predicting-Stock-Prices-Final/Predicting%20Stock%20Prices.ipynb

https://wiki.python.org/moin/WebFrameworks

https://papers.ssrn.com/sol3/results.cfm

https://www.investopedia.com/terms/s/social-science.asp

https://devopedia.org/text-clustering

https://towardsdatascience.com/getting-started-with-text-vectorization-2f2efbec6685

https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/

https://www.tensorflow.org/text/tutorials/text_classification_rnn

https://www.youtube.com/watch?v=BJ0MnawUpaU

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/?utm_source=blog&utm_medium=6-pretrained-models-text-classification

https://aclanthology.org/P05-1022.pdf

https://www.brown.edu

https://www.analyticsvidhya.com/blog/2020/03/6-pretrained-models-text-classification/

https://devopedia.org/text-clustering

https://wiki.python.org/moin/WebFrameworks

https://en.m.wikipedia.org/wiki/Speech_corpus

https://scikit-learn.org/stable/

https://pytorch.org

https://en.wikipedia.org/wiki/Political_economy

https://en.wikipedia.org/wiki/Political_communication

https://www.crummy.com/software/BeautifulSoup/

https://en.m.wikipedia.org/wiki/Buckeye_Corpus

http://www.mongodb.org/

https://www.frontiersin.org/articles/10.3389/fninf.2014.00038/full

https://www.linkedin.com/learning/building-recommender-systems-with-machine-learning-and-ai/fraud-the-perils-of-clickstream-and-international-concerns

https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f

https://towardsdatascience.com/model-selection-in-text-classification-ac13eedf6146

https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568

https://scikit-learn.org/stable/modules/clustering.html

http://www.json.org

http://www.hdfgroup.org/HDF5

https://datascience.stackexchange.com/questions/20076/word2vec-vs-sentence2vec-vs-doc2vec

https://www.linkedin.com/learning/building-recommender-systems-with-machine-learning-and-ai/restricted-boltzmann-machines-rbms?contextUrn=urn%3Ali%3AlearningCollection%3A6833632864169402369

https://www.linkedin.com/learning/building-deep-learning-applications-with-keras-2-0/training-and-evaluating-the-model?resume=false

https://github.com/coding-geographies/dockerized-pytest-course

https://www.linkedin.com/learning/deep-learning-foundations-natural-language-processing-with-tensorflow/building-a-text-classifier

https://www.linkedin.com/learning/deep-learning-face-recognition/what-is-face-detection?contextUrn=urn%3Ali%3AlyndaLearningPath%3A5c9ba390498e6b9e96936099

https://medium.com/analytics-vidhya/build-a-simple-predictive-keyboard-using-python-and-keras-b78d3c88cffb

https://implicit.readthedocs.io/en/latest/als.html

https://towardsdatascience.com/build-recommendation-system-with-pyspark-using-alternating-least-squares-als-matrix-factorisation-ebe1ad2e7679

https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-2-alternating-least-square-als-matrix-4a76c58714a1

https://realpython.com/alexa-python-skill/#getting-started-with-alexa-python-development

https://www.nbshare.io/notebook/751082217/Activation-Functions-In-Python/

https://www.nbshare.io/notebook/53490821/Activation-Functions-In-Artificial-Neural-Networks-Part-2-Binary-Classification/




58 views0 comments

Recent Posts

See All

Python - Basic regression comparison

Regression models are the principles of machine learning models as well. They help to understand the dataset distributions. The objective...

Comments


bottom of page