EAGLET boosts the performance of AI agents on longer-term tasks by creating custom plans

It should have been 2025 the year of "AI agents," According to Nvidia CEO Jensen Huang and others in the AI industry. And in many ways this is the case with many leading AI model providers, e.g OpenAI, Googleand even like Chinese competitors Alibaba released fine-tuned AI models or applications that focus on a limited number of tasks, such as: B. Web search and reporting.

However, a major hurdle to a future of high-performance, reliable AI agents remains: getting them to stay on task when the task spans multiple steps. Third-party benchmark testing show that even the most powerful AI models have higher failure rates the more steps they take to complete a task and the longer they spend doing it (more than hours).

A new academic framework called EAGLET proposes a practical and efficient method to improve long-term task performance of LLM-based agents – without the need for manual data labeling or retraining.

Developed by researchers at Tsinghua University, Peking University, DeepLang AI and the University of Illinois Urbana-Champaign. EAGLET offers a "global planner" which can be integrated into existing agent workflows to reduce hallucinations and improve task efficiency.

EAGLET is a fine-grained language model that interprets task instructions – typically provided as prompts by the user or the agent’s operating environment – and generates a high-level plan for the agent (supported by its own LLM). It does not intervene during execution, but its pre-execution helps reduce planning errors and improve task completion rates.

Solving the scheduling problem for agents with long horizons

Many LLM-based agents struggle with tasks over extended periods of time because they rely on reactive, step-by-step thinking. This approach often leads to trial-and-error behavior, planning hallucinations, and inefficient trajectories.

EAGLET addresses this limitation by introducing a Global planning module which works together with the Executor agent.

Instead of merging planning and action generation into a single model, EAGLET separates them, enabling more coherent task-level strategies.

A two-stage training pipeline without human annotations

EAGLET’s planner is trained using a two-step process that does not require human-written plans or annotations.

In the first phase, synthetic plans are created using high-performance LLMs such as GPT-5 and DeepSeek-V3.1-Think.

These plans are then filtered using a novel strategy called homologous consensus filtering, which retains only those that improve task performance for both experienced and novice execution agents.

In the second phase, a rule-based reinforcement learning process further refines the planner by assessing the extent to which each plan helps multiple agents succeed using a tailored reward function.

Introducing the Executor Capability Gain Reward (ECGR)

One of EAGLET’s most important innovations is the Executor Capability Gain Reward (ECGR).

This reward measures the value of a generated plan by checking whether it helps both high- and low-performing agents complete tasks more successfully and with fewer steps.

It also includes a decay factor to encourage shorter, more efficient task progressions. This approach avoids over-rewarding plans that are only useful to already competent agents and promotes more general planning guidance.

Compatible with existing agents and models

The EAGLET planner is modular and "plug and play," This means it can be inserted into existing agent pipelines without the need to retrain the executor.

In evaluations, the scheduler increased performance on a variety of basic models, including GPT-4.1, GPT-5, Llama-3.1, and Qwen2.5.

It also proved effective regardless of prompting strategy, working well with standard ReAct-style prompts as well as approaches such as reflection.

Highest performance on all benchmarks

EAGLET was tested on three widely used benchmarks for long-horizon agent tasks: ScienceWorld, which simulates scientific experiments in a text-based laboratory environment; ALFWorld, which tasks agents with completing household activities using natural language in a simulated home environment; and WebShop, which evaluates targeted behavior in a realistic online shopping interface.

In all three cases, the EAGLET-equipped executor agents outperformed their non-planning counterparts and other planning baselines, including MPO and KnowAgent.

In experiments with the open source model Llama-3.1-8B-Instruct, EAGLET increased average performance from 39.5 to 59.4, an increase of +19.9 points across all tasks.

In ScienceWorld scenarios not shown, it increased performance from 42.2 to 61.6.

In the scenarios observed by ALFWorld, EAGLET improved results from 22.9 to 54.3, representing a more than 2.3x increase in performance.

Even stronger increases were recorded for more powerful models.

For example, GPT-4.1 with EAGLET improved from 75.5 to 82.2 average score and GPT-5 increased from 84.5 to 88.1 despite already performing strongly.

In some benchmarks, the performance increases were up to +11.8 points, for example when combining EAGLET with the ETO executor method on invisible ALFWorld tasks.

Compared to other planning frameworks such as MPO, EAGLET consistently delivered higher task completion rates. For example, in ALFWorld’s unseen tasks with GPT-4.1, MPO scored 79.1, while EAGLET scored 83.6 – a lead of +4.5 points.

Additionally, the paper reports that agents using EAGLET complete tasks in fewer steps on average. With GPT-4.1 as the executor, the average step count dropped from 13.0 (no planner) to 11.1 (EAGLET). With GPT-5 it dropped from 11.4 to 9.4, supporting the claim of improved execution efficiency.

Efficiency gains in training and execution

Compared to RL-based methods such as GiGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with about one-eighth the training effort.

This efficiency also carries over to execution: agents using EAGLET typically required fewer steps to complete tasks. This leads to a reduction in inference time and computational cost in production scenarios.

No public law yet

At the time of the version submitted to arXiv, the authors have not published an open source implementation of EAGLET. It is unclear if and when the code will be released, under what license, or how it will be maintained, which could limit the framework’s near-term usefulness for enterprise use.

VentureBeat has reached out to the authors to clarify these points and will update this article when we hear back.

Questions still remain about enterprise deployment

Although the scheduler is described as plug-and-play, it remains unclear whether EAGLET can easily integrate with popular enterprise agent frameworks such as LangChain or AutoGen, or whether a custom stack is required to support the separation of plan and execution.

Similarly, the training setup leverages multiple executor agents, which may be difficult to replicate in enterprise environments with limited model access. VentureBeat asked the researchers whether the homologous consensus filtering method could be adapted for teams that only have access to an executor model or limited computing resources.

The authors of EAGLET report success with all model types and sizes, but it is not yet known what the minimum feasible model scale is for practical use. For example, can enterprise teams effectively use the planner with open models under 10 billion parameters in latency-sensitive environments? Additionally, the framework may provide industry-specific value in areas such as customer support or IT automation. However, it remains to be seen how easily the planner can be fine-tuned or customized for such industries.

Real-time vs. pre-generated planning

Another open question is how best to use EAGLET in practice. Should the scheduler work in real time along with the executors within a loop, or is it better used offline to pre-generate global plans for known task types? Each approach has implications for latency, cost, and operational complexity. VentureBeat posed this question to the authors and will report on any findings.

Strategic compromises for corporate teams

For medium to large enterprise technical leaders, EAGLET represents a compelling proof of concept for improving the reliability and efficiency of LLM agents. But without public tools or implementation guidelines, the framework still represents a build versus wait decision. Organizations must balance the potential gains in task completion and efficiency against the costs of reproducing or aligning the Consider the training process in your own company.

Possible use cases in corporate environments

For companies developing agentic AI systems – especially in environments that require step-by-step planning, such as: B. IT automation, customer support or online interactions – EAGLET offers a template for integrating planning without retraining. Its ability to guide both open and closed source models, along with its efficient training method, could make it an attractive starting point for teams looking to improve agent performance with minimal overhead.