The purpose of this contribution is to try LDA methodology in Python. LDA is one of many methodologies to infer topic from unstructured text data. Topic modeling is unsupervised learning, we generate many models out of text and afterwards select the one with the highest coherence score. Coherence measures are e.g. Gensim coherence model, TC-W2V. LDA provides high coherence scores compared to NMF or LSA for obvious text, but for large scientific text corpora SeNMFk works better. The interpretability of LDA output is usually not correlated with human brain topic identification.
1. Theory
Bigram — A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2.
BOW — Bag of words
Chunksize — controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory.
Dirichlet distributions — a way to model random probability mass functions (PMFs) for finite sets.
Dirichlet kernel — it is periodic version of the sinus cardinalis: one the one hand 𝐷𝑛 is obviously periodic (with period 2𝜋. On the other hand, if 𝑛 is large and 𝑥 is small holds:
sin((𝑛+1/2)𝑥)sin(𝑥/2)≈sin((𝑛+1/2)𝑥)𝑥/2=(𝑛+1/2)2sinc((𝑛+1/2)𝑥)
Dump() — dumps() method is used when the objects are required to be in string format and is used for parsing, printing, etc, . The dump() needs the json file name in which the output has to be stored as an argument.
Gibbs sampling — Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. Sampler can be blocked or collapsed.
Glob — The glob module is a useful part of the Python standard library. glob (short for global) is used to return all file paths that match a specific pattern.
Latent Dirichlet Allocation — LDA — generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar.
Lemmatization — the algorithmic process of determining the lemma of a word based on its intended meaning.
TBuckets — measure the quality of LDA, they use singular value decomposition (SVD) to discover important themes in topic words and ILP based optimization to find optimal word-bucket assignments.
TopClus — unsupervised topic discovery method that jointly models words, documents and topics in a latent spherical space derived from pretrained language model representations.
Topic Coherence — measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic.
Topic discovery — we model each short document as a Gaussian topic over word embeddings in the vector space.
Topic modeling methodologies — LDA, LSA (latent semantic analysis), NMF (non negative matrix factorisation), SeNMFk (large scientific texts). In Scikit-learn, we can generate TF-IDF weighted document term matrix by using TfidfVectorizer.
Triangle — each corner has a topic.
Trigram — a sequence of three adjacent elements from a string of tokens, which are typically letters, syllables, or words. A trigram is an n-gram for n=3.
2. Python code
3. References
https://www.researchgate.net/publication/338491281_Yoga-Veganism_Correlation_Mining_of_Twitter_Health_Data/download
https://en.wikipedia.org/wiki/Coherence_(physics)
https://aclanthology.org/E17-2070.pdf
https://math.aalto.fi/opetus/MatOhjelmistot/2016syksySCI/Lectures/html/plot3d.html
https://www.chegg.com/homework-help/questions-and-answers/3-looping-3-d-plots-3-points-question-create-3d-plot-dlv-function-f-x-y-cos-x-sin-y-follow-q67161879
https://latex-cookbook.net/3d-plot/
https://fasttext.cc/docs/en/cheatsheet.html
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
https://en.wikipedia.org/wiki/Word2vec
https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
https://en.wikipedia.org/wiki/Latent_semantic_analysis
https://www.scientific.net/KEM.840.430
https://en.wikipedia.org/wiki/Sine_and_cosine_transforms
https://en.wikipedia.org/wiki/Fourier_transform
https://pypi.org/project/gensim/
https://towardsdatascience.com/topic-model-visualization-using-pyldavis-fecd7c18fbf6
https://neptune.ai/blog/pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know
https://pyldavis.readthedocs.io/en/latest/readme.html
https://en.wikipedia.org/wiki/Peter_Gustav_Lejeune_Dirichlet
https://en.wikipedia.org/wiki/Lemmatisation
https://en.wikipedia.org/wiki/Bigram
https://www.researchgate.net/publication/362396976_PG-CODE_Latent_dirichlet_allocation_embedded_policy_knowledge_graph_for_government_department_coordination
https://solace.com/blog/topic-discovery-explorer/
https://math.stackexchange.com/questions/2306017/sinc-function-vs-dirichlet-kernel
https://lettier.com/projects/lda-topic-modeling/
https://www.tensorflow.org/probability/api_docs/python/tfp/substrates/numpy/distributions/Dirichlet
https://ieeexplore.ieee.org/document/7837989
https://www.researchgate.net/publication/357552935_Topic_Analysis_of_Superconductivity_Literature_by_Semantic_Non-negative_Matrix_Factorization
https://en.wikipedia.org/wiki/Dirichlet_distribution
https://en.wikipedia.org/wiki/Gibbs_sampling#Variations_and_extensions
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
https://www.youtube.com/watch?v=T05t-SqKArY
https://github.com/wjbmattingly/topic_modeling_textbook
https://towardsdatascience.com/the-python-glob-module-47d82f4cbd2d
https://www.youtube.com/watch?v=TKjjlp5_r7o
https://github.com/yumeng5/TopClus
https://journalofbigdata.springeropen.com/articles/10.1186/s40537-016-0039-2
http://electron6.phys.utk.edu/optics421/modules/m5/Coherence.htm
https://www.academia.edu/7766596/Coherence_and_Cohesion_for_the_Assessment_of_Text_Readability
https://www.researchgate.net/publication/361491360_Analyzing_voter_behavior_on_social_media_during_the_2020_US_presidential_election_campaign/link/62d852ea05df5805dab543a1/download
Comments