SymbolEmbeddingHandler

Overview

SymbolEmbeddingHandler is an abstract base class that provides handling of symbol embeddings. This class is designed to filter, fetch, and process symbol embeddings that are required to build and retrieve complex vector representations (embeddings) of symbols. A Symbol represents distinct logic, such as a python class, method or local variable, along with a unique URI, typically in software applications.

SymbolEmbeddingHandler, in its derived classes, handles embedding of symbol documents and symbol code, using a vector database to store and retrieve these embeddings.

Example

The following example demonstrates how to use SymbolEmbeddingHandler:

As it’s an abstract base class, you cannot make an instance of this class directly. But, you can subclass it and provide concrete methods.

from automata.symbol.base import Symbol
from automata.core.base.database.vector import VectorDatabaseProvider
from automata.embedding.base import EmbeddingBuilder
from automata.symbol_embedding.handler import SymbolEmbeddingHandler

class MySymbolEmbeddingHandler(SymbolEmbeddingHandler):
    def process_embedding(self, symbol: Symbol):
        # add your embedding process here
        pass

# create instances of necessary classes
symbol = Symbol.from_string("scip-python python automata 75482692a6fe30c72db516201a6f47d9fb4af065 `automata.tools.base`/ToolNotFoundError#__init__().")
vector_database = VectorDatabaseProvider()  # replace with a concrete implementaion of VectorDatabaseProvider
embedding_builder = EmbeddingBuilder()  # replace with a concrete implementaion of EmbeddingBuilder

handler = MySymbolEmbeddingHandler(vector_database, embedding_builder)
handler.process_embedding(symbol)  # this will call your own process_embedding implementation

Limitations

The primary limitation comes from the abstract nature of this class. As it lacks concrete implementation, it must be subclassed and its abstract methods must be implemented before use. It also relies on the presence of ordered entries in the embedding database, which means the embedding database must support such functionality.

Further, it assumes that every symbol has a dotpath representation that can be used to fetch the symbol embedding from the database. If a symbol doesn’t have a dotpath representation or the database doesn’t have the corresponding entry, it won’t return the correct embedding.

Follow-up Questions:

  • What needs to be done to accommodate symbols that do not have a dotpath representation, or to populate the database with symbols that lack corresponding entries?

  • How robust is the symbol filtering mechanism? Could there be performance or accuracy issues when handling extensive or complex symbol batches?

  • Is it possible to handle symbols and their embeddings that don’t conform to the typical structure, such as those with multiple or nested dotpaths?

  • In the base EmbeddingHandler, what kind of expectations does the class have on the structure and format of vector representations for embeddings?