The purpose of this contribution is to try various pretrained semantic models and compare cosinus similarity scores among sentences. As example we use the comparison of various definitions for term "ontology".
1. Theory
cosinus similarity score - The higher similarity, the lower distances. When you pick the threshold for similarities for text/documents, usually a value higher than 0.5 shows strong similarities.
embedding - instance of some mathematical structure contained within another instance.
pytorch_cos_sim - Cosine similarity measures the similarity between two vectors of an inner product space. Smaller angles between vectors produce larger cosine values, indicating greater cosine similarity. For example: When two vectors have the same orientation, the angle between them is 0, and the cosine similarity is 1. Perpendicular vectors have a 90-degree angle between them and a cosine similarity of 0.
semantic models - The all-* models where trained on all available training data (more than 1 billion training pairs) and are designed as general purpose models. The following models have been specifically trained for Semantic Search: Multi-QA Models, MSMARCO, Multi-Lingual, ..
SentenceTransformer - SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings.
stsb-roberta-large - pretrained model for sentence transformations. Other examples are all-MiniLM-L6-v2, stsb-mpnet-base-v2, stsb-roberta-base-v2, nli-roberta-base-v2, stsb-distilbert-base, nli-bert-large, average_word_embeddings_glove.6B.300d, average_word_embeddings_levy_dependency, …
tensor - algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space.
transformer - A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data.
2. Python code
# 2.1. Semantic similarity between two sentences
sentence1 = "An explicit specification of a conceptualization."
sentence2 = "A systematic account of existence."
# encode sentences to get their embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)
# cosinus similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)
print("Sentence 1:", sentence1)
print("Sentence 2:", sentence2)
print("Similarity score:", cosine_scores.item())
Sentence 1: An explicit specification of a conceptualization.
Sentence 2: A systematic account of existence.
Similarity score: 0.4527572989463806
# 2.2. Semantic similarity between two lists of sentences
sentences1 = ["An explicit specification of a conceptualization.", "Set of representational primitives with which to model a domain of knowledge or discourse."]
sentences2 = ["A formal naming and definition of the types, properties, and interrelationships of the entities.", "A systematic account of existence."]
# encode list of sentences to get their embeddings
embedding1 = model.encode(sentences1, convert_to_tensor=True)
embedding2 = model.encode(sentences2, convert_to_tensor=True)
# compute similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)
for i in range(len(sentences1)):
for j in range(len(sentences2)):
print("Sentence 1:", sentences1[i])
print("Sentence 2:", sentences2[j])
print("Similarity Score:", cosine_scores[i][j].item())
print()
Sentence 1: An explicit specification of a conceptualization.
Sentence 2: A formal naming and definition of the types, properties, and interrelationships of the entities.
Similarity Score: 0.5923953652381897
Sentence 1: An explicit specification of a conceptualization.
Sentence 2: A systematic account of existence.
Similarity Score: 0.4527572989463806
Sentence 1: Set of representational primitives with which to model a domain of knowledge or discourse.
Sentence 2: A formal naming and definition of the types, properties, and interrelationships of the entities.
Similarity Score: 0.4082657992839813
Sentence 1: Set of representational primitives with which to model a domain of knowledge or discourse.
Sentence 2: A systematic account of existence.
Similarity Score: 0.2979518175125122
# 2.3. Top K most similar sentences from a corpus given a sentence
corpus = ["A formal naming and definition of the types, properties, and interrelationships of the entities.",
"A systematic account of existence.",
"A specification of a conceptualization.",
"It studies concepts such as existence, being, becoming and reality.",
"It represents a domain of discourse as a common ground for encoding content meaning and user interests.",
"The branch of philosophy that deals with the nature of existence.",
"Set of concepts (terms) and the relationships among them as representing the consensual knowledge of a specific domain.",
"A “formal specification of a shared conceptualization”." ,
"Set of representational primitives with which to model a domain of knowledge or discourse."
]
# encode corpus to get corpus embeddings
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
sentence = "An account of existence."
# encode sentence to get sentence embeddings
sentence_embedding = model.encode(sentence, convert_to_tensor=True)
# top_k results to return
top_k=2
# compute similarity scores of the sentence with the corpus
cos_scores = util.pytorch_cos_sim(sentence_embedding, corpus_embeddings)[0]
# Sort the results in decreasing order and get the first top_k
top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]
print("Sentence:", sentence, "\n")
print("Top", top_k, "most similar sentences in corpus:")
for idx in top_results[0:top_k]:
print(corpus[idx], "(Score: %.4f)" % (cos_scores[idx]))
Sentence: An account of existence.
Top 2 most similar sentences in corpus:
A systematic account of existence. (Score: 0.7924)
The branch of philosophy that deals with the nature of existence. (Score: 0.5599)
3. References
https://www.sbert.net/examples/applications/semantic-search/README.html
https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0
https://en.wikipedia.org/wiki/Semantic_similarity
https://medium.com/s/story/blockchain-is-a-semantic-wasteland-9450b6e5012
https://medium.com/towards-data-science/understanding-semantic-segmentation-with-unet-6be4f42d4b47
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)
https://www.sbert.net
https://towardsdatascience.com/semantic-similarity-using-transformers-8f3cb5bf66d6
https://arxiv.org/abs/1907.11692
https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/
Comments