EmbeddingSimilarityCalculator
EmbeddingSimilarityCalculator is a class in the
automata.embedding.base module. It takes an instance of
EmbeddingVectorProvider and calculates the similarity score between
a query text and symbol embeddings.
Overview
EmbeddingSimilarityCalculator leverages embeddings representation to
quantify the similarity between code symbols and a given query text. It
uses the dot product of the query embedding and the symbol embeddings.
If required, the resulting similarity scores can be sorted in descending
order by default.
Every instance of EmbeddingSimilarityCalculator is initialized with
an EmbeddingVectorProvider and a type of norm for vector
normalization (EmbeddingNormType). Initially, it sets these
parameters with the corresponding values.
The main method in this class is calculate_query_similarity_dict.
This method retrieves the embedding for a provided query text,
calculates the similarity scores with the existing symbol embeddings,
constructs a dictionary with these scores indexed by the symbols and
optionally sorts the dictionary.
Example:
In this example, EmbeddingSimilarityCalculator is utilized to find
the symbol most similar to a given query text:
from automata.embedding.base import EmbeddingSimilarityCalculator, EmbeddingVectorProvider
from automata.symbol.base import Symbol
from numpy import array
# Create an instance of the class
mock_provider = EmbeddingVectorProvider()
embedding_sim_calc = EmbeddingSimilarityCalculator(mock_provider)
# Define query_text, and embeddings
query_text = "symbol1"
ordered_embeddings = [Embedding(Symbol('symbol1'), 'symbol1', array([1,0,0,0])),
Embedding(Symbol('symbol2'), 'symbol2', array([0,1,0,0])),
Embedding(Symbol('symbol3'), 'symbol3', array([0,0,1,0]))]
# Use the calculate_query_similarity_dict method
result = embedding_sim_calc.calculate_query_similarity_dict(ordered_embeddings, query_text)
print(result)
Note: In real scenario EmbeddingVectorProvider would be an
instance of class that provides actual embeddings like
OpenAIEmbedding.
Limitations
The accuracy of EmbeddingSimilarityCalculator heavily depends on the
quality of the embeddings produced by EmbeddingVectorProvider. Poor
embeddings can result in inaccurate similarity scores. Additionally, it
does not inherently handle cases where symbols might have the same
embedding values.
Follow-up Questions:
If two symbols end up having the same embedding, how does the
EmbeddingSimilarityCalculatordifferentiate between them?How are the results affected if a different norm type (
EmbeddingNormType) is used?