Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Would you like to insight in your inbox? Register for our weekly newsletters to only receive the company manager of Enterprise AI, data and security managers. Subscribe now
A new framework from researchers The University of Hong Kong (HKU) and joint institutions offer an open source basis for the creation of robust AI agents who can operate computers. The frame called Opencuacomprises the tools, data and recipes for scaling the development of computer use agents (CUAS).
Models that have been trained with this framework are strong at CUA benchmarks, exceed existing open source models and compete closely with closed agents from leading AI laboratories such as Openaai and Anthropic.
Computer use agent are designed in such a way that you autonomously do tasks on a computer, from navigating websites to the operation of complex software. You can also help automate workflows in the company. However, the most capable CUA systems are proprietary, with critical details about their training data, architecture and development processes being kept private.
“Since the lack of transparency limits technical progress and causes security concerns, the research community needs really open CUA framework conditions to examine its skills, restrictions and risks” Your newspaper.
AI scale hits its limits
Power caps, rising token costs and infection delays change the company -ai. Take our exclusive salon to find out how top teams: Top teams are:
Secure your place to stay in front: https://bit.ly/4mwgngo
At the same time, there are open source efforts with their own hurdles. There was no scalable infrastructure to collect the diverse large -scale data required for the training of these agents. Existing open source data records for graphical user interfaces (GUIs) have limited data, and many research projects provide insufficient details about their methods, which makes it difficult for others to replicate their work.
According to the paper, these restrictions “generally hinder progress in general and restrict sensible research into their scalability, generalizability and potential learning approaches.”
OpenCUA is an Open source framework that has met these challenges by scaling both data acquisition and the models themselves. In essence, the Agentnet tool is located for the recording of human demonstrations of computer tasks on various operating systems.
The tool optimizes the data acquisition by running on the personnel computer of an annotator, capturing screen videos, mouse and keyboard entrances, and the underlying barrier frisiko tree, which provides structured information on elements on the screen. These raw data are then processed in “State Action Trajekories”, whereby a screenshot of the computer (the status) is coupled with the corresponding action of the user (a click, button, etc.). Annotators can then check, edit and submit these demonstrations.
With this tool, the researchers have collected the Agentnet data record, which contains over 22,600 task demonstrations via Windows, MacOS and Ubuntu and includes more than 200 applications and websites. “This data record authentically records the complexity of human behavior and the environmental dynamics from personal computer environments of the users,” says the paper.
The researchers recognized that the screen consumption tools for companies are taking significant data protection concerns and developed the Agentnet tool, taking into account security. Xinyuan Wang, co-author of the paper and doctoral student at the HKU, explained that they implemented a multi-layer-private protection framework. “Firstly, annotators can completely observe the data they generate … before they decide whether they submit them,” he told Venturebeat. The data is then subjected to a manual review for data protection problems and automated scanning by a large model for recognizing a remaining sensitive content before release. “This layered process ensures the robustness of the company quality for environments that process sensitive customer or financial data,” added Wang.
In order to speed up the assessment, the team also curated Agentnetbench, an offline benchmark that provides several correct actions for each step and offers a more efficient way to measure the performance of an agent.
The Opencua framework introduces a new pipeline for the processing of data and training computer users. The first step converts the raw human demonstrations into clean state pairs that are suitable for training Vision language models (VLMS). However, the researchers found that the single training of models for these couples also achieved limited increases in performance with large amounts of data.
The most important insights were to expand these trajectories Chain of thought (Cot) argument. This process generates a detailed “inner monologue” for every action that includes planning, memory and reflection. This structured thinking is organized in three levels: a high -ranking observation of the screen, reflective thoughts that analyze the situation and plan the next steps, and finally the concise, executable action. This approach helps the agent to develop a deeper understanding of the tasks.
“We find natural language terrain of crucial importance for generalizable computer-use models and help Cuas to internalize cognitive skills,” the researchers write.
This data synthesis pipeline is a general framework that can be adapted by companies to train agents for their own unique internal tools. According to Wang, a company can record demonstrations of its proprietary workflows and use the same reflector and generator pipeline to create the necessary training data. “This enables you to start a powerful agent that is tailored to your internal tools without manually manually having the traces of reasoning,” he explained.
The researchers applied the OpenCua framework to train a number of open source VLMs, including variants of QWen and Kimi-VL, with the parameter sizes of 3 to 32 billion 32 billion. The models were rated in a number of online and offline benchmarks that test their ability to carry out tasks and understand GUI.
The 32-billion parameter model OPENCUA-32B has set a new state-of-the-art success rate under open source models on the Osworld-verified benchmark. It also exceeded Openai’s GPT-4O-based CUA And the proprietary models, which leads the performance gap with Anthropics, closed considerably.
For corporate developers and product leaders, research offers several important findings. The Opencua method is generally applicable and improves performance in models with different architectures (both densely and Mixture of expertise) and sizes. The trained agents also show a strong generalization and achieve well in a variety of tasks and operating systems.
According to Wang, the framework is particularly suitable for automation of repetitive, labor-intensive corporate workflows. “In the Agentnet data record, for example, we record some demonstrations of the start of EC2 instances on Amazon AWS and the configuration of annotation parameters on Mturk,” he told Venturebeat. “These tasks include many sequential steps, but follow repeatable patterns.”
However, Wang stated that bridging the gap into the live provision requires important challenges in terms of security and reliability. “The biggest challenge in real provision is security and reliability: the agent must avoid mistakes that accidentally change the system settings or trigger harmful side effects beyond the intended task,” he said.
The researchers published them codePresent Data recordAnd Weights for your models.
If open source agents based on frameworks such as Opencua are able to further develop the relationship between knowledge workers and their computers. Wang presents a future in which the ability of complex software becomes less important than the ability to clearly formulate the goals of an AI agent.
He described two main work modes: “Offline automation, in which the agent uses its broader software knowledge to pursue a task of end-to-end” and “online cooperation, in which the agent reacts in real time and works next to humans, similar to a colleague”. Basically, people will deliver strategic “what”, while increasingly sophisticated AI agents treat the operational “how”.