LLM function calls don't scale; code orchestration is simpler, more effective

admin

May 21, 2025 - 20:00

0 0

LLM function calls don't scale; code orchestration is simpler, more effective.

20 May, 2025

TL;DR: Giving LLMs the full output of tool calls is costly and slow. Output schemas will enable us to get structured data, so we can let the LLM orchestrate processing with generated code. Tool calling in code is simplifying and effective.

One common practice for working with MCP tools calls is to put the outputs from a tool back into the LLM as a message, and ask the LLM for the next step. The hope here is that the model figures out how to interpret the data, and identify the correct next action to take.

This can work beautifully when the amount of data is small, but we found that when we tried MCP servers with real-world data, it quickly breaks down.

MCP in the real-world

We use Linear and Intercom at our company. We connected to their latest official MCP servers released last week to understand how they were returning tool calls.

It turns out that both servers returned large JSON blobs in their text content. These appeared to be similar to their APIs, with the exception that the text content did not come with any pre-defined schemas. This meant that the only reasonable way to parse them was to have the LLM interpret the data.

These JSON blobs are huge! When we asked Linear's MCP to list issues in our project, the tool call defaulted to returning only 50 issues, ~70k characters corresponding to ~25k tokens.

The JSON contains lots of id fields that take up many tokens, and are not semantically meaningful.

When using Claude with MCPs, it seems that the entire JSON blob gets sent back to the model verbatim.

This approach quickly runs into issues. For example, if we wanted to get the AI to sort all the issues by due date and display them, it would need to reproduce all the issues verbatim as output tokens! It'd be slow, costly, and could potentially miss data.

The data in our issues also often contained a lot of distracting information: steps to reproduce an issue, errors, maybe even prompts a user used, or instructions to follow up with a user. The model could fail to emit some of these data accurately, or even worse, deviate from the original instructions.

Data vs Orchestration

The core problem here is that we're confounding orchestration and data processing together in the same chat thread.

The "multi-agent" approach tries to address this by spinning up another chat thread ("agent") to focus only on the data processing piece. It performs better when carefully tuned, but it's still awkward when our data is already well structured.

If the MCP servers are already returning data in a JSON format, it seems much more natural to parse the data, and instead operate on the structured data. Back to our sorting example, rather than asking the LLM to reproduce the outputs directly, we could instead run a sort operation on the data, and return the new array. No hallucinations and this scales to any size of inputs.

Code execution as data processing

This sounds oddly familiar as we already have code interpreters with AI. When we start to bring code execution as a fundamental way to process data from MCP tools (Code Act, Smol Agents), this opens the door to scalable ways for AI models to work.

Variables as memory. Instead of having an external memory system, the LLM can use variables (system memory) to store any data. Storing a memory is assigning a value to a variable, peeking at the variable is printing it, and the model can choose to pass the variable as an argument when calling another function. Even better, if the language used is well-typed, the model also leverages the schema.

Tool chaining. Code can orchestrate multiple function calls: performing them in parallel or taking the outputs from one or more calls and using them as inputs into another. The dependencies between the function calls are implicitly represented via the computation graph the code represents. Importantly, the LLM is not required to regurgitate the data, and we have guarantees of completeness.

Scalable processing. Transforming large amounts of data is naturally possible with code. The model can choose to use loops, or lean on libraries such as NumPy or pandas for large data transformations.

Code can also call other LLMs under the hood: you can have the LLM write code that calls LLMs for unstructured data processing (LLM-inception).

Is MCP ready?

MCP specs already define input schemas, and they’ve just introduced output schemas.

Once output schemas are widespread, we expect them to unlock use cases on large datasets: building custom dashboards, creating weekly reports on tickets completed, or having the autonomous agents monitor and nudge stalled issues forward.

What makes code execution hard?

The challenge now shifts to the MCP client side. Most execution environments today run in a tightly controlled sandbox; security is paramount as we're dealing with user-/AI-generated code.

Allowing an execution environment to also access MCPs, tools, and user data requires careful design to where API keys are stored, and how tools are exposed.

In our designs, we created sandboxed environments that are keyed with specific API access, the models are provided with documentation on how to call these APIs such that they are able send/retrieve information without ever seeing secrets.

Most execution environments are stateful (e.g., they may rely on running Jupyter kernels for each user session). This is hard to manage and expensive if users expect to be able to come back to AI task sessions later. A stateless-but-persistent execution environment is paramount for long running (multi-day) task sessions.

These constraints are creating what we think is a new category of runtimes - "AI runtimes" that use LLMs to orchestrate and perform tasks. We're still in the early phases of working out all the details for this code-execution approach, and we'd love feedback from anyone tackling similar problems. If you're interested in our approach, you can head to Lutra to give it a try.