Python - LDA and topic model

Sep 1, 2022

The purpose of this contribution is to try LDA methodology in Python. LDA is one of many methodologies to infer topic from unstructured text data. Topic modeling is unsupervised learning, we generate many models out of text and afterwards select the one with the highest coherence score. Coherence measures are e.g. Gensim coherence model, TC-W2V. LDA provides high coherence scores compared to NMF or LSA for obvious text, but for large scientific text corpora SeNMFk works better. The interpretability of LDA output is usually not correlated with human brain topic identification.

1. Theory

Bigram — A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2.

BOW — Bag of words

Chunksize — controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory.

Dirichlet distributions — a way to model random probability mass functions (PMFs) for finite sets.

Dirichlet kernel — it is periodic version of the sinus cardinalis: one the one hand 𝐷𝑛 is obviously periodic (with period 2𝜋. On the other hand, if 𝑛 is large and 𝑥 is small holds:

sin((𝑛+1/2)𝑥)sin(𝑥/2)≈sin((𝑛+1/2)𝑥)𝑥/2=(𝑛+1/2)2sinc((𝑛+1/2)𝑥)

Dump() — dumps() method is used when the objects are required to be in string format and is used for parsing, printing, etc, . The dump() needs the json file name in which the output has to be stored as an argument.

Gibbs sampling — Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. Sampler can be blocked or collapsed.

Glob — The glob module is a useful part of the Python standard library. glob (short for global) is used to return all file paths that match a specific pattern.

Latent Dirichlet Allocation — LDA — generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar.

Lemmatization — the algorithmic process of determining the lemma of a word based on its intended meaning.

TBuckets — measure the quality of LDA, they use singular value decomposition (SVD) to discover important themes in topic words and ILP based optimization to find optimal word-bucket assignments.

TopClus — unsupervised topic discovery method that jointly models words, documents and topics in a latent spherical space derived from pretrained language model representations.

Topic Coherence — measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic.

Topic discovery — we model each short document as a Gaussian topic over word embeddings in the vector space.

Topic modeling methodologies — LDA, LSA (latent semantic analysis), NMF (non negative matrix factorisation), SeNMFk (large scientific texts). In Scikit-learn, we can generate TF-IDF weighted document term matrix by using TfidfVectorizer.

Triangle — each corner has a topic.

Trigram — a sequence of three adjacent elements from a string of tokens, which are typically letters, syllables, or words. A trigram is an n-gram for n=3.

2. Python code

3. References

https://www.researchgate.net/publication/338491281_Yoga-Veganism_Correlation_Mining_of_Twitter_Health_Data/download

https://en.wikipedia.org/wiki/Coherence_(physics)

https://aclanthology.org/E17-2070.pdf

https://math.aalto.fi/opetus/MatOhjelmistot/2016syksySCI/Lectures/html/plot3d.html

https://www.chegg.com/homework-help/questions-and-answers/3-looping-3-d-plots-3-points-question-create-3d-plot-dlv-function-f-x-y-cos-x-sin-y-follow-q67161879

https://latex-cookbook.net/3d-plot/

https://fasttext.cc/docs/en/cheatsheet.html

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

https://en.wikipedia.org/wiki/Word2vec

https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0

https://en.wikipedia.org/wiki/Latent_semantic_analysis

https://www.scientific.net/KEM.840.430

https://en.wikipedia.org/wiki/Sine_and_cosine_transforms

https://en.wikipedia.org/wiki/Fourier_transform

https://pypi.org/project/gensim/

https://towardsdatascience.com/topic-model-visualization-using-pyldavis-fecd7c18fbf6

https://neptune.ai/blog/pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know

https://pyldavis.readthedocs.io/en/latest/readme.html

https://en.wikipedia.org/wiki/Peter_Gustav_Lejeune_Dirichlet

https://en.wikipedia.org/wiki/Lemmatisation

https://en.wikipedia.org/wiki/Bigram

https://www.researchgate.net/publication/362396976_PG-CODE_Latent_dirichlet_allocation_embedded_policy_knowledge_graph_for_government_department_coordination

https://solace.com/blog/topic-discovery-explorer/

https://math.stackexchange.com/questions/2306017/sinc-function-vs-dirichlet-kernel

https://lettier.com/projects/lda-topic-modeling/

https://www.tensorflow.org/probability/api_docs/python/tfp/substrates/numpy/distributions/Dirichlet

https://ieeexplore.ieee.org/document/7837989

https://www.researchgate.net/publication/357552935_Topic_Analysis_of_Superconductivity_Literature_by_Semantic_Non-negative_Matrix_Factorization

https://en.wikipedia.org/wiki/Dirichlet_distribution

https://en.wikipedia.org/wiki/Gibbs_sampling#Variations_and_extensions

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

https://www.youtube.com/watch?v=T05t-SqKArY

https://github.com/wjbmattingly/topic_modeling_textbook

https://towardsdatascience.com/the-python-glob-module-47d82f4cbd2d

https://www.youtube.com/watch?v=TKjjlp5_r7o

https://github.com/yumeng5/TopClus

https://journalofbigdata.springeropen.com/articles/10.1186/s40537-016-0039-2

http://electron6.phys.utk.edu/optics421/modules/m5/Coherence.htm

https://www.academia.edu/7766596/Coherence_and_Cohesion_for_the_Assessment_of_Text_Readability

https://www.researchgate.net/publication/361491360_Analyzing_voter_behavior_on_social_media_during_the_2020_US_presidential_election_campaign/link/62d852ea05df5805dab543a1/download

Sarka Pribylova

Email: sarka.pribylova@gmail.com

Python - LDA and topic model

Recent Posts

Comments

Sarka Pribylova Email: sarka.pribylova@gmail.com

Python - LDA and topic model

Recent Posts

Comments

Sarka Pribylova

Email: sarka.pribylova@gmail.com