How many tools/functions can an AI Agent have?
Co-author:
Read this article for free using my friend’s link. You can also read all the stories in my GenAI and AI Agent series.
In previous article, we discussed how we can use “AI Agents” to improve the quality and interaction of content based solutions. In next few articles, we will explore more into the different aspects of building a good AI Agent.
As LLMs become more capable, developers are exploring their ability to interact with external tools (or function calls). However, how many tools can an AI Agent handle successfully? The answer depends on several factors, including token limits, API constraints, and the model’s architecture.
Let’s dive deeper into the technical considerations, including insights from real-world benchmarks and known limits. In this article, we’ll start with a look at the impact of token limits.
[This article is part of my AI-first Enterprise Automation with GenAI and AI Agents series]
Understanding Tool Calling in LLMs
LLMs that support function calling allow integration with external APIs, databases, and computation engines. While this enables powerful applications, there are limitations to how many tool choices can be included and handled efficiently in a single request. This is because the LLM must evaluate all the information provided for each tool on every request, and this information directly affects the tokens used.
Factors that impact the number of calls include:
- Token limitations (input + output constraints)
- The number of tools per agent
- The description of each tool to help the model’s evaluation
- The size of each tool’s signature (i.e., the parameters it requires)
- Model-specific constraints (some models enforce hard limits)
Let’s break these down in detail.
Factors That Affect the Number of Tool Calls in an LLM Request
1. Token Budgeting: Every Tool Definition Uses Tokens
Each tool’s definition (including its name, description, parameters, and expected output) takes up input tokens. This means that:
- The more tools you define, the more input tokens are used.
- The more parameters a tool has, the more input tokens are used.
- The more complex the tool response, the more tokens are used in the conversation history fed into the next request
Real-World Token Usage Examples (see reference)
- A simple tool with just 1 parameter → 96 tokens
- A tool with 28 parameters (from the BFCL) → 1633 tokens
- A toolset of 37 tools in BFCL → 6218 tokens
This means that on smaller models with limited context (like 4K tokens), 10 tools may already exceed the total limit.
2. Context Window: The Model’s Token Limit Matters
In the early days, LLMs have smaller number of tokens, but more recently, everyone seems to have standardized on at least 128K.
- granite-13b-instruct-v2: 8K tokens
- granite-3–2b-instruct: 128K tokens
- llama-3-3–70b-instruct: 128K tokens
- deepseek-r1: 128K tokens
- gpt-4o: 128K tokens
- gemini-v2-pro: 2M tokens
🔹 Example: If you’re using a model with 8K token limit, and each tool definition consumes 500 tokens, you can only use about 8 tools before you hit the 50% of the limit. Please note the token limit is for both input and output, so you should not use up all the tokens in just specifying the agent definition.
🔹 If you have a model with 128K context, you can theoretically add many many tools (100?), but sending that many tools in every interaction will reduce reasoning quality. Please note that in the Berkeley Function Calling Leadership, the highest number of functions is 37. Suffice to say nobody has systematically tested sending over a large number of tools in an agent definition.
3. Hard Limits Imposed by Providers
- OpenAI has a hard limit of 128 tools per agent (but performance degradation likely starts much sooner).
- Llama 3 and Granite have not stated any tool limits, but practical constraints (context length, token usage, accuracy, and memory consumption) will still apply.
Even if an LLM theoretically supports a large number of tools, sending all tool definitions on every interaction increases cost, decreases inference accuracy and increases latency.
How Many Tool Choices Are Too Many?
A general guideline based on token constraints and API behavior:
- 1–3 tools per agent → Safe and efficient.
- 4–10 tools per agent → Doable, but may slow down execution and consume more tokens.
- 10+ tools per agent → Run the risk of reaching token limits for smaller context window, may slow down execution even further, inference accuracy will degrade, and cost will increase.
If you’re working with a model that has a small context window (4K or 8K tokens), avoid defining more than a few tools at once. For larger models like Llama 3 (128K context), accuracy will degrade with excessive tool definitions.
Another consideration is that in the Berkeley Function Calling Leaderboard benchmark, the average number of function choice per test is 3, so the behavior of the LLM on a large tool set is not really explored much at all.
Best Practices for Managing Tool Calls in LLMs
There are several techniques one can apply to reduce the number of tools when defining AI Agents. We will get into more details in some of these in future article.
✅ Prioritize essential tools — Only define tools that are critical for the request.
✅ Batch tool calls — If multiple calls fetch similar data, combine them.
✅ Optimize tool signatures — Reduce unnecessary parameters, use short parameter names and description in order to minimize token usage.
✅ Cache tool definitions — If possible, avoid resending full tool definitions every turn. This likely require a fairly sophisticated agent to accomplish that.
✅ Use hierarchical tool selection — define meta-tools to sub-select functions in the relevant domain. This gets into the concept of sub-agents, we will discuss that in more details in the later article.
✅ Use dynamic tool activation/selection — depending on the framework use, you can filter tools evaluated for a given context, for example, narrow down the set of tools ahead of time for a given interaction when creating the AI Agent.
✅ Monitor LLM performance — Keep track of API behavior as the number of tools grows.
The above will not only help with management of token limits, but also with other aspects such operating cost, inference performance and accuracy of your appplication.
Final Thoughts
The number of tool/function calls an LLM can handle depends on token limits, model constraints, and API rules. While larger models like Llama 3, Granite 3 or GPT-4o can technically support many tools, real-world performance considerations (cost, reasoning degradation, and latency) should guide your implementation.
This article uncovered the critical impact of token limits. In upcoming pieces, we’ll dive into how the number and complexity of tools influences both model performance and accuracy. Stay tuned!
References
- Simple function: 96 tokens based on a Token Calculator.
{
"function": [
{
"name": "reminders_complete",
"description": "Marks specified reminders as completed and returns the status of the operation.",
"parameters": {
"type": "dict",
"required": ["token"],
"properties": {
"token": {
"type": "string",
"description": "Authentication token to verify the user's identity.",
}
},
},
}
]
}2. Function with many parameters (28): 1633 tokens. You can see the example in BFCL.
3. Many functions (37): 6218 tokens. You can see the example in BFCL.