ToolEvaluationMetrics
ToolEvaluationMetrics is a class developed to evaluate, quantify,
and provide detailed metrics accumulated from a sequence of Tool
Evaluation Results. It offers a comprehensive overview of evaluation
results from tool testing, gathered in one handy reference data object.
Overview
ToolEvaluationMetrics class is initialized using a list of tool
evaluation results. It then provides various metrics such as the total
count of evaluations, total full matches, total partial matches and
their respective rates. The interpretive metrics available aid in
understanding, to a fine degree, the performance and effectiveness of
the tools being evaluated.
Example
Following is a simple example demonstrating the usage of
ToolEvaluationMetrics.
from automata.eval.tool.tool_eval_metrics import ToolEvaluationMetrics
from yourmodule import YourToolEvalResults # Replace with actual module and class
evaluation_results = [YourToolEvalResults()] # List of evaluation results
metrics = ToolEvaluationMetrics(evaluation_results)
print(f'Total Evaluations Conducted: {metrics.total_evaluations}')
print(f'Total Full Matches Observed: {metrics.total_full_matches}')
print(f'Total Partial Matches Observed: {metrics.total_partial_matches}')
print(f'Full Match Rate: {metrics.full_match_rate}')
print(f'Partial Match Rate: {metrics.partial_match_rate}')
Note: Replace ``YourToolEvalResults`` with the actual class that contains results of tool evaluations.
Limitations
ToolEvaluationMetrics is dependent on the result objects that have
to contain is_full_match and is_partial_match properties to
compute the full and partial matches. If a tool’s evaluation results do
not include these properties, the ToolEvaluationMetrics cannot compute
these metrics.
The above example assumes that you’ve replaced YourToolEvalResults
with the proper class of your tool evaluation results.
Follow-up Questions:
Is there a way to handle the computation of metrics even when the
is_full_matchandis_partial_matchare not present within the tool’s evaluation results?How are
is_full_matchandis_partial_matchbeing decided within the tool’s evaluation results?