Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Trust in the agents -KI: Why the evaluation infrastructure has to come first


When AI agents are used in the real world, organizations are under pressure to define where they belong, how they build effectively and how to operationalize them on the scale. At Venturebeat Transformation 2025The technology leaders gathered to talk about how to change their business with agents: Joanne Chen, general partner at Foundation Capital; Shailesh Nalawadi, VP of project management with Sendbird; Thys Waanders, SVP of the AI ​​transformation at Kognigy; and Shawn Malhotra, CTO, rocket companies.

https://www.youtube.com/watch?v=dchzgcf1poo

A few top agentic AI applications

“The initial attraction of one of these provisions for AI agents is to save human capital – mathematics is quite uncomplicated,” said Nalawadi. “However, this underlines the ability to transform that they receive with AI agents.”

At Rocket, AI agents have proven to be powerful tools for the increasing conversion of website.

“We have found that with our agent -based experience, the conversation experience on the website is converted three times more often when they come through this channel,” said Malhotra.

But that only scratches the surface. For example, a rocket engineer built an agent in just two days to automate a highly specialized task: calculation of transmission tax during the mortgage insurer.

“These two days of efforts saved us one million dollars a year in cost appares,” said Malhotra. “In 2024 we saved more than a million team memberships, mainly outside of our AI solutions. This is not only the saving. It also enables our team members to concentrate their time on people who are often the greatest financial transaction in their lives.”

Agents are essentially charging individual team members. These millions saved hours are not the entire job of someone who is repeated many times. They are breaks of the job that employees do not like to do, or no added value for the customer. And these millions of saved hours give rocket the ability to manage more business.

“Some of our team members were able to edit 50% more customers last year than in the previous year,” added Malhotra. “It means that we have a higher throughput, can drive more business, and again we see higher conversion rates because you spend the time to understand the needs of the customer than do much more red work that AI can now do.”

Attack the complexity of the agent

“Part of the trip for our engineering teams is to change from the way of thinking of software engineering. Write once and test them and test it and give the same answer 1,000 times – the more probabilistic approach, in which they ask the same LLM and there are different answers due to some probability,” said Nalawadi. “A lot of it brought people with them. Not only software engineers, but product managers and UX designers.”

What helped is that LLMS has put a long way, said Waanders. If you had to build something 18 months or two years ago, you really had to choose the right model, or the agent would not cut off as expected. Now, he says, we are now in a phase in which most mainstream models behave very well. They are more predictable. Today, however, it is the challenge to combine models, to ensure reaction ability, to orchest the right models in the correct order and to weave them in the correct data.

“We have customers who push tens of millions of conversations a year,” said Waanders. “For example, if you automate 30 million discussions in one year, how can we have this scaling in the world in the LLM world?

A layer over the LLM orchestrates a network of agents, said Malhotra. A conversation experience has a network of agents under the bonnet, and the orchestrator decides which agents the request from the available requirements.

“If you play this forward and think about having hundreds or thousands of agents who are capable of different things, you have some really interesting technical problems,” he said. “It becomes a bigger problem because latency and time are important. Routing of agents will be a very interesting problem in the coming years.”

Tap provider relationships

Up to this point, the first step for most companies that brought Agentic AI onto the market were in their own house because there were not yet specialized tools. However, you cannot differentiate and create value by building a generic LLM infrastructure or the AI ​​infrastructure.

“We often find the most successful conversations that we have with potential customers, to be someone who has already built something of their own,” said Nalawadi. “You quickly realize that it is okay to get to a 1.0, but when the world develops and the infrastructure develops and you have to exchange the technology for something new, you do not have the ability to orchestrate all of these things.”

Preparation for the complexity of the agents -KI

In theory, the agents -KI will only grow in complexity – the number of agents in an organization increases and they will learn from each other, and the number of applications will explode. How can organizations prepare for the challenge?

“It means that the exams in your system are more stressed,” said Malhotra. “For something that has a regulatory process, you have a person in the loop to ensure that someone logs on. Do you have the right warning and surveillance so that you know that you are dealing with the person who deals with people with a person when they deal with a person.”

How can you be confident that a AI agent will behave reliably if it develops?

“This part is really difficult if you didn’t think about it at the beginning,” said Nalawadi. “The short answer is: Before you even start building, you should have an Eval infrastructure. Make sure you have a strict environment where you know how good, from an AI agent, and that you have this test sentence. Still referring to you while making improvements. A very simple way of thinking about the evaluation is that the unity tests for your agency system.”

The problem is that it is not deterministic, added Waanders. Unit tests are critical, but the biggest challenge is that you do not know what you do not know – what false behavior an agent could possibly indicate how it could react in a certain situation.

“You can only find out by simulating conversations on the scale, sliding them under thousands of different scenarios and then analyze how it keeps it and how it reacts,” said Waanders.

Leave a Reply

Your email address will not be published. Required fields are marked *