Productionising GenAI Agents: Evaluating Tool Selection with Automated Testing

Introduction

Generative AI agents are changing the landscape of how businesses interact with their users and customers. From personalised travel search experiences to virtual assistants that simplify troubleshooting, these intelligent systems help companies deliver faster, smarter, and more engaging interactions. Whether it’s Alaska Airlines reimagining customer bookings or ScottsMiracle-Gro offering tailored gardening advice, AI agents have become essential.

However, deploying these agents in dynamic environments brings its own set of challenges. Frequent updates to models, prompts, and tools can unexpectedly disrupt how these agents operate. In this blog post, we’ll explore how businesses can navigate these challenges to ensure their AI agents remain reliable and effective.

What is this blog post about?

This post focuses on a practical framework for one of the most crucial tasks for getting GenAI agents into production: ensuring they can select tools effectively. Tool selection is at the heart of how generative AI agents perform tasks, whether retrieving weather data, translating text, or handling error cases gracefully.

We’ll introduce a testing framework designed specifically for evaluating GenAI agents’ tool selection capabilities. This framework includes datasets for various scenarios, robust evaluation methods, and compatibility with leading models like Gemini and OpenAI. By exploring this approach, we will gain actionable insights into how to test, refine, and confidently deploy GenAI agents in dynamic production environments.

All the code for this framework can be found on Github!

Why should we care?

In production, even the most advanced GenAI agents are only as good as their ability to pick and use the right tools for the task at hand. If an agent fails to call the correct API for weather information or mishandles an unsupported request, it can undermine user trust and disrupt business operations.

Tool selection is central to an agent’s functionality, but it’s also highly vulnerable to changes in model updates or prompts. Without rigorous testing, even minor tweaks can introduce regressions, causing agents to fail in unpredictable ways.

That is why a structured testing framework is critical. It allows businesses to detect issues early, validate changes systematically, and ensure that their agents remain reliable, adaptable, and robust – no matter how the underlying components evolve. For companies looking to deploy AI agents at scale, investing in such a framework is essential for long-term success.

What are GenAI agents?

GenAI agents are systems powered by large language models (LLMs) that can perform actions – not just generate text. They process natural language inputs to understand user intentions and interact with external tools, APIs, or databases to accomplish specific tasks. Unlike traditional AI systems with predefined rules, GenAI agents dynamically adapt to new contexts and user needs.

At their core, these agents combine natural language understanding with functional execution. This makes them highly versatile, whether they’re responding with a direct answer, requesting clarification, or calling an external service to complete a task.

Real-world use cases

GenAI agents are already transforming industries, proving their value across a wide range of applications. Here are some examples, taken directly from Google’s blog post:

Customer Support: Alaska Airlines is developing natural language search, providing travelers with a conversational experience powered by AI that’s akin to interacting with a knowledgeable travel agent. This chatbot aims to streamline travel booking, enhance customer experience, and reinforce brand identity.

Automotive Assistance: Volkswagen of America built a virtual assistant in the myVW app, where drivers can explore their owners’ manuals and ask questions, such as, "How do I change a flat tire?" or "What does this digital cockpit indicator light mean?" Users can also use Gemini’s multimodal capabilities to see helpful information and context on indicator lights simply by pointing their smartphone cameras at the dashboard.

E-commerce: ScottsMiracle-Gro built an AI agent on Vertex AI to provide tailored gardening advice and product recommendations for consumers.

Healthcare: HCA Healthcare is testing Cati, a virtual AI caregiver assistant that helps to ensure continuity of care when one caregiver shift ends and another begins. They are also using gen AI to improve workflows on time-consuming tasks, such as clinical documentation, so physicians and nurses can focus more on patient care.

Banking: ING Bank aims to offer a superior customer experience and has developed a gen-AI chatbot for workers to enhance self-service capabilities and improve answer quality on customer queries.

They show how GenAI agents are becoming central to improving productivity, automating workflows, and delivering highly personalised user experiences across industries. They are no longer just supporting systems – they’re active participants in business operations.

How GenAI agents work

GenAI agents operate by combining natural language understanding with task execution, enabling them to perform a variety of actions based on user queries. When a user inputs a request, the agent determines the intent behind it and decides on the appropriate course of action. This may involve directly responding using its internal knowledge, asking for clarification if key details are missing, or taking an action via an external tool.

The agent’s workflow is dynamic and highly context-aware. For instance, if the query requires accessing real-time data or performing a calculation, the agent will usually integrate with external tools or APIs. If a request is ambiguous, like "Book me a table," it may prompt the user to specify details like the restaurant or time before proceeding.

Once the agent has figured out how to act, it either generates a natural language response or prepares inputs for tool execution. After completing the task, the agent processes the results to deliver an output that’s clear, actionable, and aligned with the user’s intent.

This entire flow, from understanding user intent to executing tasks, makes GenAI agents capable of handling complex, multi-step interactions in a natural and user-friendly manner.

Tool selection for GenAI agents

Tool selection is one of the most critical capabilities for GenAI agents, enabling them to bridge user inputs with external functions to perform tasks effectively. The process involves identifying the most suitable tool based on the query’s intent and the agent’s repository of tools. For instance, a request like "Translate this text into French" prompts the agent to select a translation tool, while "Set a reminder for tomorrow at 3 PM" would call a calendar tool.

Once the tool is selected, the agent extracts the relevant parameters from the query and formats them according to the tool’s specifications. For example, in a weather-related query like "What’s the weather in Tokyo tomorrow?", the agent identifies "Tokyo" as the location and "tomorrow" as the date, then structures these inputs for the weather API. After invoking the tool, the agent processes the response to ensure it meets the user’s expectations. Structured data like JSON is transformed into natural language, and errors such as invalid inputs or unavailable data are communicated to the user, often with suggestions for refinement.

By dynamically selecting the right tool and handling outputs precisely, the agent ensures it can execute tasks accurately and efficiently. This capability is foundational to its ability to deliver seamless user experiences.

The importance of tool selection

Tool selection is what differentiates GenAI agents from simple conversational chatbots and allows us to build powerful, action-oriented systems. Understanding user queries and generating responses is essential, but identifying and utilising the correct tools ensures agents can take action in real-world scenarios. Missteps, such as choosing the wrong tool or poorly formatting inputs, can frustrate users and make them lose trust in the agent’s capabilities. To be truly effective, tool selection must be robust, precise, and adaptable. It’s this mechanism that ensures GenAI agents are not only responsive but genuinely capable of accomplishing tasks in dynamic environments.

Continuous Testing to ensure agent reliability

Production environments are always changing, which makes reliability one of the biggest challenges for GenAI agents. Model updates, prompt adjustments, or changes to the tool catalogue can lead to workflows to failing, which could result in incorrect outcomes, and therefore undermine user trust.

Continuous testing is what ensures agents remain reliable and functional, even as these changes happen. It systematically checks core functionalities, like tool selection, to catch issues early. For instance, a workflow involving a weather tool might stop working after a model update or after tweaking the system prompt . Automated testing can identify such failures before they affect users, giving teams the chance to address them quickly. This way, agents can continue delivering good results without interruption.

In addition, real-world scenarios change too, and continuous testing helps developers to adapt their agentic system to new tasks and use cases. By using and amending datasets that represent realistic situations, teams can ensure the agent performs well across a range of user needs. Automated pipelines make this process scalable and consistent, integrating directly into development workflows. This approach allows teams to keep improving and expanding their agents without sacrificing reliability.

Tools Selection Testing Framework for GenAI Agents

Enough theory, let’s dive into the code! 🧑‍💻

To address the need for continuous testing when it comes to tools selection for GenAI agents, we will have a look at the Tool Selection Testing Framework. This framework provides a structured, repeatable, and scalable method to evaluate and enhance our GenAI agent’s tool selection capabilities. By testing a variety of real-world scenarios and analysing the agent’s responses, the framework helps us identify strengths and weaknesses in our agentic application.

The core idea of the framework is quite simple: We present an LLM with a series of test cases and observe which tools it selects. Each test case is designed to represent a specific scenario that the agent might encounter in real-world interactions. By evaluating the agent’s performance across these scenarios, we can gain valuable insights into the agent’s behavior, identify areas for improvement, and iteratively refine our setup (e.g. tweaking the system prompt).

Advantages of the Framework

Reproducibility: Tests can be run repeatedly, making it easier to track progress over time.
Granularity: Isolates specific tool selection decisions, simplifying debugging and targeted improvements.
Scalability: Easily extended to accommodate a growing number of test cases and tols.
Comprehensive Evaluation: Goes beyond exact matches and considers semantic equivalence of the model’s responses with the ground truth.
Bring your own responses: It allows to use the dataset from this framework and create your own responses, either via a different method or with a different model. These responses can then be evaluated within this framework.

A quick note about function calling: The default method for tool selection for the models supported in this framework (OpenAI and Gemini) will be function calling. It is straightforward and a capability that is baked into these models. That being said, function calling is only one of many methods for tool selection. Other methods are controlled generation or just simply asking the model to provide the tool selection (and the parameters) in a text response.

Repository structure

This is the structure of our repo:

genai-agent-tool-selection-testing/
├── main.py                # Main entry point and test orchestration
├── models.py              # Model implementations (OpenAI, Gemini)
├── evaluator.py           # Evaluation logic and metrics
├── model_tester.py        # Test execution engine
├── utils.py               # Utility functions for processing responses
├── tools/
│   ├── functions.py      # Function definitions for tool calling
│   └── function_registry.py       # Function registry and model-specific formatting
├── datasets/
│   ├── test_dataset.json  # Combined test dataset
├── prompts/
│   ├── semantic_judge_tool_selection.txt
│   ├── semantic_judge_error.txt
│   └── semantic_judge_clarifying.txt
│   ├── semantic_judge_no_tool.txt
│   └── semantic_judge_not_supported.txt
├── results/              # Test run outputs
├── requirements.txt
└── README.md

It contains a tool registry for collecting and providing the tools for the LLMs, a dataset folder, and several main components, such as the _modeltester and the evaluator.

Let’s dive deeper into the main components!

The tools

The tools are stored model independent. As mentioned before, the idea is to support several models for this framework and to that end it is important to ensure that all these models can use these tools. Because every provider has a slightly different way of equipping the models with tools we need to register them from the tools repository to the model.

The tools look like so:

registry.register(Function(
    name="get_weather",
    description="Get the weather in a given location",
    parameters=[
        FunctionParameter(
            name="location",
            type="string",
            description="The city name of the location for which to get the weather."
        )
    ]
))

registry.register(Function(
    name="get_current_time",
    description="Get the current time in a specified timezone.",
    parameters=[
        FunctionParameter(
            name="timezone",
            type="string",
            description="The timezone to get the current time for, e.g., 'America/New_York'."
        )
    ]
))

registry.register(Function(
    name="translate_text",
    description="Translate a given text to a target language.",
    parameters=[
        FunctionParameter(
            name="text",
            type="string",
            description="The text to translate."
        ),
        FunctionParameter(
            name="target_language",
            type="string",
            description="The language to translate the text into, e.g., 'Spanish'."
        )
    ]
))

Depending on which model will be used those functions will then be registered with the LLM at runtime:

class FunctionRegistry:
    def __init__(self):
        self.functions: Dict[str, Function] = {}

    def register(self, function: Function):
        self.functions[function.name] = function

    def get_functions_for_model(self, model_type: str) -> Union[List[dict], Tool]:
        if model_type == "openai":
            return [f.to_openai_format() for f in self.functions.values()]
        elif model_type == "gemini":
            declarations = [f.to_gemini_format() for f in self.functions.values()]
            return Tool(function_declarations=declarations)
        else:
            raise ValueError(f"Unsupported model type: {model_type}")

In total there are 15 tools in this repo, but those could easily be amended.

The dataset

For this framework I created a dataset consisting of five different subsets that will test different expected behaviours from the LLMs. Each subset is structured the same way and contains:

Test case ID
User Query: The input that triggers the agent’s action. This could be a natural language question, a command, or any other form of input relevant to the agent’s domain. For example: "What’s the weather in London?" or "Translate ‘Hello’ into Spanish."
Ground Truth: The correct tool or function the LLM should choose to handle the query (along with the correct arguments and values for the tool) OR in cases where no tool should be selected, the response the LLM should provide.

1) Tool Selection: The purpose of this dataset is to test the LLM’s capability to select the appropriate tool based on the user’s query.

Example entry:

{
    "id": "A001",
    "user_query": "What's the weather like in New York?",
    "ground_truth": {
      "function_call": {
        "name": "get_weather",
        "arguments": {
          "location": "New York"
        }
      }
    }
  }

2) No tools: Ensure the agent responds directly from its internal knowledge without using any tools when appropriate.

Example entry:

{
    "id": "B002",
    "user_query": "Who wrote Romeo and Juliet?",
    "ground_truth": {
        "text": "Romeo and Juliet was written by William Shakespeare.",
        "no_function_call": true
    }
}

3) Clarifying: Checks if the agent appropriately asks for missing information when the user’s query is incomplete or ambiguous.

Example entry:

{
    "id": "C007",
    "user_query": "Convert this measurement.",
    "ground_truth": {
        "text": "Sure, could you please specify the value and the units you'd like to convert from and to?",
        "no_function_call": true
    }
}

4) Error handling: Assess the agent’s ability to handle invalid inputs gracefully.

Example entry:

{
    "id": "D002",
    "user_query": "Set a reminder to attend the meeting on April 31st, 2024.",
    "ground_truth": {
        "text": "Apologies, but April has only 30 days. Could you provide a valid date for the reminder?",
        "no_function_call": true
    }
}

5) Not supported: Verify that the agent gracefully informs the user when it cannot fulfill a request due to limitations.

Example entry:

{
    "id": "E012",
    "user_query": "Control the thermostat and set it to 72 degrees.",
    "ground_truth": {
      "text": "I'm sorry, but I can't control home devices like thermostats.",
      "no_function_call": true
    }
}

These datasets are used to simulate interactions between the user and an agent. During testing, the framework iterates through each test case, presenting the user query to the agent and capturing its response. The LLM’s output is then compared against the ground truth specified in the dataset to determine whether the test case is passed or failed.

By analyzing the agent’s performance across all test cases, we can identify patterns, understand the model’s decision-making process, and pinpoint areas that require improvement.

The Flow of the Application

The flow of the application is relatively straightforward. First let’s have a look at the diagram that illustrates the logic:

When running the framework we specify a few parameters first: The model we want to run (Gemini 1.5 Flash/Pro, GPT4o-mini, etc). We also specify the dataset we want to use and the prompt and model for the semantic judge, more on that later. Finally we can specify if we only want to create the model responses, only run the evaluation (and bring our own responses), or the entire pipeline.

python main.py 
    --model-type gemini 
    --dataset datasets/test_dataset.json 
    --semantic-judge-model gemini-1.5-pro-002

Once the process kicks off it will load the dataset and the tools. As mentioned earlier the tools are stored in a model-independent format, ensuring that they can be used with either the Gemini or the OpenAI models. The tool registration will take care of that.

Then the test cases will be sent to the chosen LLM – this happens asynchronously so that the all the test cases will be processed in parallel, thereby speeding up the process.

Once the responses have been recorded they are being converted into a model-independent format. This happens so that the evaluation logic doesn’t have to deal with model-specific responses from different models.

Raw response (from Gemini) - image by author — Raw response (from Gemini) – image by author

Model-independent response - image by author — Model-independent response – image by author

The evaluator then compares the model responses with the ground truth from the dataset (again asynchronously and in parallel). Either the responses match exactly, in that case the evaluation for that test case is completed. If response and ground truth do not match exactly then they are being sent to the semantic judge. This is a different instance of an LLM that will compare the response from the LLM, compare it to the ground truth and decide whether they mean the same. If they do, then the test case was passed successfully. Otherwise it will be marked as a miss.

Here is an example of the semantic judge deciding on two different responses:

{
  "test_case": "B004",
  "user_query": "Who was the first person to walk on the moon?",
  "expected_text": "Neil Armstrong was the first person to walk on the moon on July 20, 1969.",
  "model_text": "The first person to walk on the moon was astronaut Neil Armstrong. He took his historic first step on the lunar surface on July 20, 1969, during NASA's Apollo 11 mission. Armstrong famously said, "That's one small step for [a] man, one giant leap for mankind," as he stepped onto the moon.",
  "is_semantically_equivalent": true,
  "judge_explanation": "equivalentnBoth responses correctly identify Neil Armstrong as the first person to walk on the moon on July 20, 1969. Response 2 provides additional context about the Apollo 11 mission and Armstrong's famous quote, but the core information remains the same.n",
  "timestamp": "2024-11-18T16:12:49.442638"
}

As we can see, the responses were not identical, but they mean the same. The judge even acknowledges the differences and provides a reason why they are still the same ("Response 2 provides additional context about the Apollo 11 mission and Armstrong’s famous quote, but the core information remains the same.")

The framework then aggregates all the results and gives an overall accuracy as well as providing detailed reports that make it easy for debugging and iterating over the setup:

In particular the test results will enable us to quickly identify where the model responded incorrectly, for example by using a function call when it shouldn’t have:

Examples

Example 1: Testing Tool Selection from a User Query and a List of Tools

Let’s bring the framework to life with a concrete example. Imagine we’re building a GenAI agent designed to assist users with various tasks, and we want to test its ability to select the appropriate tool from a predefined list. Suppose the user asks: "What’s the weather forecast for San Francisco tomorrow?"

Our agent has access to the following tools:

get_weather(location, date): Retrieves the weather forecast for a given location and date.
get_news(topic): Retrieves news articles related to a specific topic.
set_reminder(time, message): Sets a reminder for a specific time and message.

In this scenario, the correct tool selection is obviously get_weather(location="San Francisco", date="tomorrow"). Let’s see how our framework would evaluate the agent’s performance.

The framework presents the user query ("What’s the weather forecast for San Francisco tomorrow?") to the agent. The agent, based on its internal logic, should then select and execute the get_weather tool, providing the necessary arguments: location="San Francisco" and date="tomorrow". The framework captures this tool selection and compares it against the expected selection. If the agent correctly chooses get_weather with the correct arguments, the test case is marked as a success.

However, let’s consider a few alternative scenarios and how the framework handles them:

Incorrect Tool Selection: If the agent selects get_news or set_reminder, the framework logs this as a failure, indicating a flaw in the agent’s understanding of the user’s intent.
Correct Tool, Incorrect Arguments: If the agent selects get_weather but provides incorrect arguments (e.g., location="London"), the framework also logs this as a failure, highlighting the need for more precise argument extraction.
Alternative Phrasing: If the agent generates a slightly different function call, such as retrieve_weather_forecast(city="San Francisco", date="tomorrow"), the semantic judge steps in. If the judge determines that this alternative call is semantically equivalent to the expected get_weather call (i.e., it would achieve the same outcome), the test case might still be considered a success, demonstrating the framework’s flexibility in handling variations in tool representation.

This example illustrates how the framework systematically evaluates the agent’s tool selection capabilities, providing valuable insights into its strengths and weaknesses across diverse scenarios.

Example 2: The Model Needs to Realize It Can Answer Right Away (No Tool Needed)

Sometimes, the smartest tool an agent can use is its own knowledge. This is true for information that is (relatively) static and likely to be in the model’s training data. Let’s consider a scenario where the agent possesses the information required to answer a user’s query directly, without needing to call any external tools. This tests the agent’s ability to recognize when action is unnecessary and to provide a direct response.

Suppose the user asks: "What’s the capital of France?" And let’s assume our agent’s internal knowledge base already contains this information.

The available tools for this scenario might include:

query_database(query): Queries a database for information.
search_web(query): Searches the web for information.

However, the optimal approach in this case is for the agent to not select any tool and instead directly respond with "Paris."

Our framework handles this scenario by checking whether the agent attempts to call a tool. If the agent correctly refrains from using any tools and provides the correct answer ("Paris"), the test case is marked as a success. This validates the agent’s ability to discern when direct response is appropriate, demonstrating a higher level of understanding and efficiency.

Conversely, if the agent incorrectly selects a tool like query_database or search_web, the framework flags this as a failure. This indicates a potential flaw in the agent’s decision-making process, suggesting it might be overly reliant on external tools even when the answer is readily available internally. This type of error can lead to unnecessary computational overhead and slower response times, highlighting the importance of testing for scenarios where no tool selection is the optimal strategy.

Conclusion

We’ve explored the essential concepts behind GenAI agents and their unique capabilities that differentiate them from standard LLM applications. By exploring the importance of tool selection, we’ve seen how this capability forms the foundation for building agentic systems capable of taking action.

We covered the challenges of deploying GenAI agents in dynamic environments and why continuous testing is vital to ensure their reliability. And, most importantly, we introduced the Tool Selection Testing Framework as a structured, scalable way to evaluate and refine agents’ ability to choose the right tools under diverse scenarios. Through practical examples, we demonstrated how the framework can identify strengths, highlight weaknesses, and help teams iteratively improve their GenAI agents.

Of course, there’s always room for growth. Expanding the framework to support more models, integrate multi-agent setups, or add parallel tool selection are just a few of the opportunities for future development.

I encourage you to explore the repository, adapt it to your needs, and contribute to its evolution. Whether you’re building GenAI agents for e-commerce, healthcare, or customer support, this framework equips you to deploy reliable, adaptable systems that can thrive in dynamic environments.

Heiko Hotz

👋 Follow me on Medium and LinkedIn to read more about Generative AI, Machine Learning, and Natural Language Processing.

👥 If you’re based in London join one of our NLP London Meetups.

Productionising GenAI Agents: Evaluating Tool Selection with Automated Testing

Introduction

What is this blog post about?

Why should we care?

What are GenAI agents?

Real-world use cases

How GenAI agents work

Tool selection for GenAI agents

The importance of tool selection

Continuous Testing to ensure agent reliability

Tools Selection Testing Framework for GenAI Agents

Advantages of the Framework

Repository structure

The tools

The dataset

The Flow of the Application

Examples

Conclusion

Heiko Hotz

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

Our Columns