Salesforce study finds LLM agents flunk CRM and confidentiality tests

A new benchmark developed by academics shows that LLM-based AI agents perform below par on standard CRM tests and fail to understand the need for customer confidentiality.
A team led by Kung-Hsiang Huang, a Salesforce AI researcher, showed that using a new benchmark relying on synthetic data, LLM agents achieve around a 58 percent success rate on tasks that can be completed in a single step without needing follow-up actions or more information.
Using the benchmark tool CRMArena-Pro, the team also showed performance of LLM agents drops to 35 percent when a task requires multiple steps.
Another cause for concern is highlighted in the LLM agents' handling of confidential information. "Agents demonstrate low confidentiality awareness, which, while improvable through targeted prompting, often negatively impacts task performance," a paper published at the end of last month said.
The Salesforce AI Research team argued that existing benchmarks failed to rigorously measure the capabilities or limitations of AI agents, and largely ignored an assessment of their ability to recognize sensitive information and adhere to appropriate data handling protocols.
The research unit's CRMArena-Pro tool is fed a data pipeline of realistic synthetic data to populate a Salesforce organization, which serves as the sandbox environment. The agent takes user queries and decides between an API call or a response to the users to get more clarification or provide answers.
"These findings suggest a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios," the paper said.
The findings might worry both developers and users of LLM-powered AI agents. Salesforce co-founder and CEO Marc Benioff told investors last year that AI agents represented "a very high margin opportunity" for the SaaS CRM vendor as it takes a share in efficiency savings accrued by customers using AI agents to help get more work out of each employee.
Elsewhere, the UK government has said it would target savings of £13.8 billion ($18.7 billion) by 2029 with a digitization and efficiency drive that relies, in part, on the adoption of AI agents.
AI agents might well be useful, however, organizations should be wary of banking on any benefits before they are proven. ®
What's Your Reaction?






