ToolEval
ToolEval is an abstract class designed for evaluating the
performance of tools by generating and comparing expected and observed
results. It has several methods you can override to customize the
evaluation process. It requires the expected output, the tool executor,
and the function call to generate the eval result.
Overview
ToolEval is a core part of the evaluation system in Automata. It
provides a structure and means to evaluate how well a tool performs in
its task. This class requires implementation of the extract_action
and to_tool_result methods, meaning you can give it specific
evaluation behaviours such as how to translate operations and determine
the equivalence between expected and observed actions.
Example
Please note that ToolEval is an abstract base class and cannot be
instantiated directly. The following is an example demonstrating how to
create an implementation of ToolEval.
from automata.eval.tool.tool_eval import ToolEval
from automata.eval.eval_base import EvalResult, Action
from typing import Tuple, Optional, List
class CustomToolEval(ToolEval):
def extract_action(self, input_action_tuple: Tuple) -> Action:
# Custom implementation of action extraction
pass
def to_tool_result(self, expected_action: Action, observed_action: Optional[Action]) -> EvalResult:
# Custom method of evaluating tool results
pass
def _filter_actions(self, actions: List[Action]) -> List[Action]:
# Custom implementation to filter actions if necessary
pass
Limitations
The limitations of the ToolEval class are up to the implemented
class, as ToolEval is an abstract base class. However, it’s worth
noting that it does not inherently include any failure recovery or retry
mechanisms. If these are necessary for your use case, you should include
them in your implementation.
Follow-up Questions:
What are some common strategies for implementing
extract_actionandto_tool_result?How can we handle cases where the tool execution fails?
How can this be used in conjunction with other parts of the Automata project? Is there a method to easily integrate this with existing task environments or tool executors?