EmbeddingNormType
EmbeddingNormType is an enumeration class that provides different
methods to normalize the embeddings vectors in the Automata’s core
functionalities. It is used in the process of comparing and ranking
symbol embeddings based on their similarity to a query.
Overview
EmbeddingNormType supports L1 and L2 methods. When used in
an embedding similarity calculator instance, this setting determines how
the distance between the query embedding vector and symbol embeddings
are calculated.
L1: Calculates the L1 norm (Manhattan distance) between two vectors.L2: Calculates the L2 norm (Euclidean Distance) between two vectors.
EmbeddingNormType is supplied as an argument to the
EmbeddingSimilarityCalculator class during initialization,
determining the norm method used for all similarity calculations within
that instance.
Example
from automata.embedding.base import EmbeddingNormType, EmbeddingSimilarityCalculator
from automata.symbol_embedding.base import SymbolCodeEmbedding
from automata.core.base.database.vector import JSONSymbolEmbeddingVectorDatabase
# Assuming you have a set of symbol embeddings stored in a JSON database.
database_path = "path_to_your_database.json"
embedding_db = JSONSymbolEmbeddingVectorDatabase(database_path)
# Assume we have a mock embedding provider
mock_provider = MockEmbeddingProvider()
# Instantiate an EmbeddingSimilarityCalculator with L1 norm.
similarity_calculator = EmbeddingSimilarityCalculator(
embedding_provider=mock_provider,
norm_type=EmbeddingNormType.L1,
)
# Get ordered embeddings from database and compute similarity of a query_text
ordered_embeddings = embedding_db.get_ordered_embeddings()
query_text = 'def initialize(x, y):'
similarity_dict = similarity_calculator.calculate_query_similarity_dict(ordered_embeddings, query_text)
# The keys of the returned dictionary are the symbols and the values are the similarity scores.
most_similar_symbol = max(similarity_dict, key=similarity_dict.get)
print(f"Symbol most similar to the query is {most_similar_symbol}")
Limitations
The EmbeddingNormType only supports L1 and L2 norm methods. While
these methods cover typical use cases in calculating document
similarity, there are other distance measurement norms which could be
useful in different contexts, such as cosine similarity or Hamming
distance.
Other limitations would be that the user must ensure to match the norm type to the nature of the embeddings used - as certain norm types may not be suitable or produce the desired results given the type or characteristics of the embedding vectors.
Follow-up Questions:
How could other norm types be added to the
EmbeddingNormType?Could the norm type be dynamically set or changed for running instances of
EmbeddingSimilarityCalculator?