Python - Semantic libraries

Aug 30, 2022

The purpose of this contribution is to test and compare various semantic libraries and models in python. We can see the basic insights from nltk and spacy libraries, compare fasttext, gensim, CBOW and PCA methodologies. All mechanisms are using text tokenisation as input, providing basic clusters and contextual outputs.

1. Theory

Adam - adaptive optimizer, good with sparse data: the adaptive learning rate is perfect for this type of datasets. It adds to the advantages of Adadelta and RMSprop, the storing of an exponentially decaying average of past gradients similar to momentum. First try this optimiser.

AffinityPropagation - In statistics and data mining, affinity propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points.

Binary classification task (that is, using a sigmoid as activation function and binary cross-entropy as loss).

Categorical_crossentropy - CBOW loss type compilation methodology.

CBOW - CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words).

Dense - Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True). These are all attributes of Dense.

Embedding - relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words.

Embeddings initialisers - glorot_uniform, glorot normal.

Euclidean distance - Euclidean distance between two points in Euclidean space is the length of a line segment between the two points, e.g. between 2 word pairs.

Euclidean distance matrix - Euclidean distance matrices are closely related to Gram matrices (matrices of dot products, describing norms of vectors and angles between them). The latter are easily analyzed using methods of linear algebra. Distance matrices are used to represent protein structures in a coordinate-independent manner, as well as the pairwise distances between two sequences in sequence space.

Fan-in: the maximum number of inputs that a system can accept.

Fan-out: the maximum number of inputs that the output of a system can feed to other systems. E.g. Fan out equal to 100. Fan-out is a messaging pattern used to model an information exchange that implies the delivery (or spreading) of a message to one or multiple destinations possibly in parallel.

Fasttext - an extension of word2vec model, treats each word as composed of character n-grams. FastText word embeddings generate better word embeddings for rare and out of vocabulary words because even if words are rare their character n-grams are still shared with other words.

gensim - robust, efficient and scalable implementation of the Word2Vec model.

Glorot normal initialisation - technique is almost the same as Glorot uniform. The limit value is sqrt( 2 / (nin + nout)) and the random values are pulled from the normal (also called Gaussian) distribution instead of the uniform distribution: def init_weights(self): # Glorot normal nin = self.ni; nout = self.

Normal considers only backpropagation.

Glove - The GloVe model stands for Global Vectors which is an unsupervised learning model which can be used to obtain dense word vectors similar to Word2Vec.

He initialisation - Xavier / Glorot initialization if the activation function is a Tanh, and that He initialization is the recommended one if the activation function is a ReLU.

KMeans - k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Lambda - The Lambda layer exists so that arbitrary expressions can be used as a Layer when constructing Sequential and Functional API models.

Model to dot - convert keras model to dt format.

One Hot Encoding - categorical features present in the data are not ordinal. The number of categorical features present in the dataset is less, one hot can be applied for model.

Optimizers - Available optimizers in Keras are adam, adadelta, adagrad, addamax, nadam, ftrl, SGD (stochastic gradient descent), RMSprop.

PCA - principal component analysis.

Perplexity - a measurement of how well a probability distribution or probability model predicts a sample. In natural language processing, perplexity is a way of evaluating language models. A language model is a probability distribution over entire sentences or texts.

Sequential - The Sequential model is a linear stack of layers. The common architecture of ConvNets is a sequential architecture. However, some architectures are not linear stacks. For example, siamese networks are two parallel neural networks with some shared layers. A CNN can be instantiated as a Sequential model because each layer has exactly one input and output and is stacked together to form the entire network.

Sigmoid activation - A wide variety of sigmoid functions including the logistic and hyperbolic tangent functions have been used as the activation function of artificial neurons. Sigmoid curves are also common in statistics as cumulative distribution functions (which go from 0 to 1), such as the integrals of the logistic density, the normal density, and Student's tprobability density functions. The logistic sigmoid function is invertible, and its inverse is the logit function. Sigmoid is not suited for deep neural networks because of its mean value is not 0 but mean value is 0.5. In deep learning input mean values must be 0. Sigmoid activation function tends to push the initial layers to saturation during the training process, the layers move out of saturation gradually over the period of training, this is the reason we sometimes observe plateaus in the training loss over epochs.

Skipgrams - Skipgram embedding is a word embedding technique that relies on unsupervised learning and is used to predict related context words of a given target word. Skipgram embedding is commonly used to train Word2Vecmodels.

Softmax activation - often used as the last activation function of a neural network to normalize the output of a network to a probability distribution.

SVG format - model to dot.

Tanh activation - gradients are reasonably similar along all the layers(and within a decent scale— about 20x smaller than the weights).Z-values are within a decent range (-1, 1) and are reasonably similar along all the layers (though some shrinkage is noticeable. Activations have not collapsed into binary mode, and are reasonably similar along all the layers.

TF Encoding - specific tensorflow encoding.

TF-IDF - The goal of using tf-idf is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus. TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TSNE - t-distributed stochastic neighbour embedding is a statistical method for visualising high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It has better dimensionality reduction than PCA.

Word2vec - treats each word like an atomic entity and generates a vector for each word. Word2vec cannot provide good results for rare and out of vocabulary words.

Xavier/Glorot Initialization is used to maintain the same smooth distribution for both the forward pass as well the backpropagation.

Shakespeare — Random tokenised words visualisation in a book

3. References

https://www.kaggle.com/code/zzaibis/nlp-for-disaster-tweets/notebook

https://overfitter.github.io/2020-08-29-Word-Embedding-Pipeline/

https://www.nltk.org/book/ch02.html

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

https://www.tensorflow.org/api_docs/python/tf/keras/utils/model_to_dot

https://keras.io/api/layers/initializers/

https://en.wikipedia.org/wiki/Affinity_propagation

https://towardsdatascience.com/building-a-one-hot-encoding-layer-with-tensorflow-f907d686bf39

https://keras.io/api/layers/

https://towardsdatascience.com/building-a-one-hot-encoding-layer-with-tensorflow-f907d686bf39

https://en.wikipedia.org/wiki/Euclidean_distance_matrix

https://medium.com/towards-data-science/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c

https://medium.com/towards-data-science/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa

https://thinkinfi.com/fasttext-word-embeddings-python-implementation/

https://github.com/google-research/bert

https://www.pythonpodcast.com/polyglot-with-rami-al-rfou-episode-190/

https://en.wikipedia.org/wiki/Word2vec

https://towardsdatascience.com/python-packages-for-nlp-part-1-2d49126749ef

http://www.cis.hut.fi/projects/natlang/

https://stackabuse.com/gpt-style-text-generation-in-python-with-tensorflowkeras/

https://www.analyticsvidhya.com/blog/2018/03/text-generation-using-python-nlp/

https://www.thepythoncode.com/article/text-generation-keras-python

https://www.datasciencelearner.com/advanced-text-processing-using-nltk/

https://arxiv.org/pdf/2008.09470.pdf

https://www.vennify.ai/semantic-similarity-sentence-transformers/

https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8

https://www.sbert.net/examples/applications/clustering/README.html

https://medium.com/nlplanet/two-minutes-nlp-sentence-transformers-cheat-sheet-2e9865083e7a

https://pypi.org/project/semantic/

https://github.com/RaRe-Technologies/gensim

https://spacy.io

https://polyglot.readthedocs.io/en/latest/index.html

https://scikit-learn.org/stable/#

https://www.nltk.org

https://en.wikipedia.org/wiki/Perplexity

https://textblob.readthedocs.io/en/dev/

https://towardsdatascience.com/7-tips-to-choose-the-best-optimizer-47bb9c1219e

https://monkeylearn.com/blog/what-is-tf-idf/

https://spacy.io

https://pypi.org/project/fuzzywuzzy/

https://stanfordnlp.github.io/CoreNLP/

https://opennlp.apache.org

https://en.wikipedia.org/wiki/Softmax_function

https://www.educative.io/answers/what-is-skipgram-embedding

https://www.ibm.com/docs/en/app-connect/cloud?topic=apps-watson-tone-analyzer

https://cloud.google.com/natural-language

https://aws.amazon.com/comprehend/

https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404

Sarka Pribylova

Email: sarka.pribylova@gmail.com

Python - Semantic libraries

Recent Posts

Comments

Sarka Pribylova Email: sarka.pribylova@gmail.com

Python - Semantic libraries

Recent Posts

Comments

Sarka Pribylova

Email: sarka.pribylova@gmail.com