top of page

Python - Semantic similarity - Definition of Ontology

The purpose of this contribution is to try various pretrained semantic models and compare cosinus similarity scores among sentences. As example we use the comparison of various definitions for term "ontology".


1. Theory


cosinus similarity score - The higher similarity, the lower distances. When you pick the threshold for similarities for text/documents, usually a value higher than 0.5 shows strong similarities.


embedding - instance of some mathematical structure contained within another instance.


pytorch_cos_sim - Cosine similarity measures the similarity between two vectors of an inner product space. Smaller angles between vectors produce larger cosine values, indicating greater cosine similarity. For example: When two vectors have the same orientation, the angle between them is 0, and the cosine similarity is 1. Perpendicular vectors have a 90-degree angle between them and a cosine similarity of 0.


semantic models - The all-* models where trained on all available training data (more than 1 billion training pairs) and are designed as general purpose models. The following models have been specifically trained for Semantic Search: Multi-QA Models, MSMARCO, Multi-Lingual, ..


SentenceTransformer - SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings.


stsb-roberta-large - pretrained model for sentence transformations. Other examples are all-MiniLM-L6-v2, stsb-mpnet-base-v2, stsb-roberta-base-v2, nli-roberta-base-v2, stsb-distilbert-base, nli-bert-large, average_word_embeddings_glove.6B.300d, average_word_embeddings_levy_dependency, …


tensor - algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space.


transformer - A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data.


2. Python code


# 2.1. Semantic similarity between two sentences


sentence1 = "An explicit specification of a conceptualization."

sentence2 = "A systematic account of existence."


# encode sentences to get their embeddings

embedding1 = model.encode(sentence1, convert_to_tensor=True)

embedding2 = model.encode(sentence2, convert_to_tensor=True)


# cosinus similarity scores of two embeddings

cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)

print("Sentence 1:", sentence1)

print("Sentence 2:", sentence2)

print("Similarity score:", cosine_scores.item())



Sentence 1: An explicit specification of a conceptualization.

Sentence 2: A systematic account of existence.

Similarity score: 0.4527572989463806



# 2.2. Semantic similarity between two lists of sentences


sentences1 = ["An explicit specification of a conceptualization.", "Set of representational primitives with which to model a domain of knowledge or discourse."]

sentences2 = ["A formal naming and definition of the types, properties, and interrelationships of the entities.", "A systematic account of existence."]


# encode list of sentences to get their embeddings

embedding1 = model.encode(sentences1, convert_to_tensor=True)

embedding2 = model.encode(sentences2, convert_to_tensor=True)


# compute similarity scores of two embeddings

cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)

for i in range(len(sentences1)):

for j in range(len(sentences2)):

print("Sentence 1:", sentences1[i])

print("Sentence 2:", sentences2[j])

print("Similarity Score:", cosine_scores[i][j].item())

print()


Sentence 1: An explicit specification of a conceptualization.

Sentence 2: A formal naming and definition of the types, properties, and interrelationships of the entities.

Similarity Score: 0.5923953652381897


Sentence 1: An explicit specification of a conceptualization.

Sentence 2: A systematic account of existence.

Similarity Score: 0.4527572989463806


Sentence 1: Set of representational primitives with which to model a domain of knowledge or discourse.

Sentence 2: A formal naming and definition of the types, properties, and interrelationships of the entities.

Similarity Score: 0.4082657992839813


Sentence 1: Set of representational primitives with which to model a domain of knowledge or discourse.

Sentence 2: A systematic account of existence.

Similarity Score: 0.2979518175125122


# 2.3. Top K most similar sentences from a corpus given a sentence


corpus = ["A formal naming and definition of the types, properties, and interrelationships of the entities.",

"A systematic account of existence.",

"A specification of a conceptualization.",

"It studies concepts such as existence, being, becoming and reality.",

"It represents a domain of discourse as a common ground for encoding content meaning and user interests.",

"The branch of philosophy that deals with the nature of existence.",

"Set of concepts (terms) and the relationships among them as representing the consensual knowledge of a specific domain.",

"A “formal specification of a shared conceptualization”." ,

"Set of representational primitives with which to model a domain of knowledge or discourse."

]


# encode corpus to get corpus embeddings

corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

sentence = "An account of existence."


# encode sentence to get sentence embeddings

sentence_embedding = model.encode(sentence, convert_to_tensor=True)


# top_k results to return

top_k=2


# compute similarity scores of the sentence with the corpus

cos_scores = util.pytorch_cos_sim(sentence_embedding, corpus_embeddings)[0]


# Sort the results in decreasing order and get the first top_k

top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

print("Sentence:", sentence, "\n")

print("Top", top_k, "most similar sentences in corpus:")

for idx in top_results[0:top_k]:

print(corpus[idx], "(Score: %.4f)" % (cos_scores[idx]))


Sentence: An account of existence.


Top 2 most similar sentences in corpus:

A systematic account of existence. (Score: 0.7924)

The branch of philosophy that deals with the nature of existence. (Score: 0.5599)




3. References


https://www.sbert.net/examples/applications/semantic-search/README.html

https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0

https://en.wikipedia.org/wiki/Semantic_similarity

https://medium.com/s/story/blockchain-is-a-semantic-wasteland-9450b6e5012

https://medium.com/towards-data-science/understanding-semantic-segmentation-with-unet-6be4f42d4b47

https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)

https://www.sbert.net

https://towardsdatascience.com/semantic-similarity-using-transformers-8f3cb5bf66d6

https://arxiv.org/abs/1907.11692

https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/



23 views0 comments

Recent Posts

See All

Python - Basic regression comparison

Regression models are the principles of machine learning models as well. They help to understand the dataset distributions. The objective...

Comments


bottom of page