Research from Databricks shows that building better AI judges is not just a technical problem, but a human problem

The intelligence of AI models is not what blocks enterprise deployments. It is the inability to even define and measure quality.

AI judges are now playing an increasingly important role here. When assessing AI a "Judge" is an AI system that evaluates the results of another AI system.

Judge Builder is Databricks’ framework for building judges and was first delivered as part of the company Agent Bricks technology earlier this year. The framework has evolved significantly since its initial launch in response to direct user feedback and implementations.

Early versions focused on technical implementation, but customer feedback showed that the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three key challenges: getting stakeholders to agree on quality criteria, capturing expertise from experts with limited areas of expertise, and deploying rating systems at scale.

"The intelligence of the model is usually not the bottleneck, the models are really intelligent." Databricks chief AI scientist Jonathan Frankle told VentureBeat in an exclusive briefing. "Instead, the real question is: How do we get the models to do what we want, and how do we know if they did what we wanted?"

The “Ouroboros Problem” of AI Assessment

Judge Builder addresses what Pallavi Koppol, a Databricks research scientist who led development, calls this "Ouroboros problem." An ouroboros is an ancient symbol depicting a snake eating its own tail.

Using AI systems to evaluate AI systems presents a circular validation challenge.

"You want a judge to check whether your system is good, whether your AI system is good, but then your judge is also an AI system." Kopol explained. "And now you’re asking yourself: How do I know this judge is good?"

The solution lies in measurement "Distance from the ground truth of human experts" as the primary evaluation function. By minimizing the gap between how an AI judge evaluates results and the way domain experts would evaluate them, companies can trust these judges as scalable proxies for human evaluation.

This approach is fundamentally different from the traditional one Guardrail systems or single metric evaluations. Instead of asking whether or not an AI output has passed a general quality check, Judge Builder creates highly specific evaluation criteria tailored to each company’s expertise and business needs.

The technical implementation also sets it apart. Judge Builder integrates with Databricks’ MLflow timely optimization Tools and can work with any underlying model. Teams can version control their judges, track performance over time, and deploy multiple judges simultaneously across different quality dimensions.

Lessons Learned: Building Judges That Actually Work

Databricks’ work with enterprise customers revealed three key insights that apply to anyone building AI judges.

Lesson one: Your experts don’t agree as much as you think. When quality is subjective, companies find that even their own subject matter experts disagree about what constitutes acceptable results. A customer service response may be factually accurate but use an inappropriate tone. A financial overview may be comprehensive but too technical for the target audience.

"One of the biggest lessons of this entire process is that all problems become people’s problems." Frankle said. "The hardest thing is turning an idea from a person’s brain into something concrete. And the harder part is that companies are not made of one brain, but of many brains."

The fix is batch annotation with reliability checks between raters. Teams annotate examples in small groups and then measure agreement scores before moving on. This means that misalignment is detected early. In one case, three experts gave the same output a rating of 1, 5 and neutral before the discussion revealed that they interpreted the evaluation criteria differently.

Companies using this approach achieve inter-rater reliability scores of up to 0.6, compared to typical scores of 0.3 for external annotation services. Higher agreement directly leads to better judge performance because the training data contains less noise.

Lesson two: assign vague criteria to specific judges. Instead of a judge judging whether there is an answer "relevant, factual and concise," Create three separate judges. Each targets a specific quality aspect. This granularity is important because there is an error "Overall quality" The assessment shows something is wrong, but not what needs to be fixed.

The best results are achieved by combining top-down requirements such as regulatory constraints and stakeholder priorities with bottom-up detection of observed failure patterns. A customer created a top-down correctness score but discovered through data analysis that correct answers almost always cited the top two recall results. This insight became a new production-friendly judge that could ensure correctness without requiring ground truth labeling.

Lesson three: You need fewer examples than you think. Teams can assemble solid judges from just 20-30 selected examples. The key is to choose edge cases that reveal differences of opinion, rather than obvious examples where everyone agrees.

"We are able to complete this process in as little as three hours for some teams, so it doesn’t really take that long to find a good judge." said Koppol.

Production results: From pilots to seven-figure deployments

Frankle shared three metrics that Databricks uses to measure the success of Judge Builder: whether customers want to use it again, whether they increase their AI spending, and whether they move further along their AI journey.

For the first metric, a client created more than a dozen judges after their first workshop. "This client formed over a dozen judges after we rigorously walked them through practice for the first time using this framework." Frankle said. "They really put a lot of effort into the judges and now measure everything."

For the second metric, the business impact is clear. "There are several customers who have gone through this workshop and become seven-figure spenders on GenAI at Databricks in a way they weren’t before." Frankle said.

The third metric illustrates the strategic value of Judge Builder. Customers who were previously hesitant to use advanced techniques like reinforcement learning are now confident to use them because they can measure whether improvements have actually occurred.

"There are clients who have done very advanced things after having these judges that they were previously reluctant to do." Frankle said. "They went from a little prompt engineering to increased learning with us. Why spend the money on reinforcement learning and why spend the energy on reinforcement learning if you don’t know if it actually made a difference?"

What companies should do now

The teams that successfully move AI from pilot to production treat judges not as one-off artifacts, but rather as evolving assets that grow with their systems.

Databricks recommends three practical steps. First, focus on high-impact judges by identifying a critical regulatory requirement and an observed failure mode. These will become your first judge portfolio.

Second, create lean workflows with subject matter experts. A few hours reviewing 20 to 30 marginal cases is enough for most judges to calibrate themselves sufficiently. Use batch annotation and inter-rater reliability checks to denoise your data.

Third, schedule regular reviews by judges using production data. As your system evolves, new failure modes will emerge. Your judicial portfolio should evolve with you.

"A judge is a way to evaluate a model, it’s also a way to create guardrails, it’s also a way to have a metric by which you can do instant optimization, and it’s also a way to have a metric by which you can do reinforcement learning." Frankle said. "Once you have a judge that you know reflects your human tastes in an empirical form that you can query as often as you like, you can use it in 10,000 different ways to measure or improve your agents."

The “Ouroboros Problem” of AI Assessment

Lessons Learned: Building Judges That Actually Work

Production results: From pilots to seven-figure deployments

What companies should do now

Leave a ReplyCancel Reply