New vision model from Coher runs on two GPUs, strikes top animal VLMS for visual tasks

Would you like to insight in your inbox? Register for our weekly newsletters to only receive the company manager of Enterprise AI, data and security managers. Subscribe now

The climb Deep research features And other AI-powered analyzes have led to more models and services that simplify this process and want to read more of the documents actually used.

Canadian KI company Context Banking on its models, including a newly published visual model, to arrange for Deep research functions for corporate cases for companies.

The company has published the command a vision, a visual model that is aimed at corporate corporate cases that are based on the back of its Command a model. The parameter model of 112 billion can “unlock valuable knowledge from visual data and make very precise, data -controlled decisions through documents optical character detection (OCR) and image analysis,” said the company.

“Whether it is about interpreting product manuals with complex diagrams or analyzing photos of real scenes for risk detection, a vision has been excellent in combating the most demanding challenges of the corporate vision,” said the company In a blog post.

The AI Impact series returns to San Francisco – August 5th

The next phase of the AI is here – are you ready? Join the managers of Block, GSK and SAP to get an exclusive look at how autonomous agents redesign of decision-making from real time up to end-to-end automation.

Secure your place now – space is limited: https://bit.ly/3guuplf

This means that command can read and analyze a vision and the most common types of images that companies need: diagrams, diagrams, diagrams, scanned documents and PDFs.

? @related I just put on a vision @Huggingface ?
Developed for multimodal applications for companies: interpret product manuals, analyze photos, ask about diagrams … ❓ ??
A 112B-tight visual model with SOTA performance-look at the benchmark metrics in … pic.twitter.com/ormfm5f8cf
– Jeff Boudier? (@jeffboudier) July 31, 2025

Since it is based on the architecture of Command A, command requires a vision of two or less GPUs, just like the text model. The vision model also maintains the text functions of command A to read words in pictures and understand at least 23 languages. Cohere said that in contrast to other models, a vision is reduced by the total ownership costs for companies and that it is fully optimized for using applications for companies.

Like Cohere, the architect’s command a

Coher said a followed LLAV architecture To create his command a models, including the visual model. This architecture transforms visual features into soft vision token that can be divided into different tiles.

These tiles are handed over to a text tower in the command, “a density, 111b parameter Textual LLM,” said the company. “In this way, a single picture consumes up to 3,328 tokens.”

Cohere said it trained the visual model in three phases: vision language orientation, supervised fine -tuning (SFT) and strengthening after training with human feedback (RLHF).

“This approach enables the assignment of image indicator features to the voice model cinnamon room,” said the company. “In contrast, during the SFT stage, we also trained the vision coder, the vision adapter and the voice model on a variety of multimodal tasks.”

Visualization of Enterprise Ki

Benchmark tests showed that a vision exceeds other models with similar visual functions.

Cohere made a vision against the view against OpenaiGPT 4.1, Meta‘S Call 4 Maverick, mistral‘S Pixtral Large and Mistral Medium 3 in nine benchmark tests. The company did not mention whether it had tested the model against Mistral of OCR focusing API, Mistral OCR.

This allows agents to see in the visual data of their company safely and unlock the automation of tedious tasks with diagrams, diagrams, PDFs and photos. pic.twitter.com/ihznuwekrk
– Cohere (@cohere) July 31, 2025

Command A vision has enforced the other models in tests such as Chartqa, Ocrbench, AI2D and TextVQa. Overall, the command had an average score of 83.1% compared to 78.6% GPT 4.1, Lama 4 Maverick 80.5% and 78.3% of Mistral Medium 3.

Most large voice models (LLMS) are multimodal these days, ie they can generate or understand visual media such as photos or videos. However, companies generally use more graphical documents such as diagrams and PDFs, so that extracting information from these unstructured data sources is often difficult.

With deep research on the ascent, it is important to bring in models that can read, analyze and even have Download unstructured Data has grown.

Cohere also said that it offers a command in an open weight system in the hope that companies that want to move away from closed or proprietary models will begin with their products. So far there is some interest from developers.

Very impressed by its accuracy, extract the handwritten handwriting notes from a picture!
– Adam Sardo (@sardo_adam) July 31, 2025

Finally a AI that will not judge my terrible doodles.
– Martha Wisener? (@Martwisener) August 1, 2025

Daily insights into the economic use cases with VB daily

If you want to impress your boss, VB Daily covered her. We give you the Inside scoop of what companies do with generative AI, from regulatory shifts to practical deprivation, so that they can share knowledge for a maximum ROI.

Read our Data protection guideline

Thanks for subscribing. Check out more VB newsletter here.

An error occurred.

New vision model from Coher runs on two GPUs, strikes top animal VLMS for visual tasks

Like Cohere, the architect’s command a

Visualization of Enterprise Ki

Leave a ReplyCancel Reply

4 men arrested after Mississippi mass shooting that killed 4, injured 20

Customer challenge

Satellites reveal the world’s secrets: calls, text messages, military and corporate data

Like Cohere, the architect’s command a

Visualization of Enterprise Ki

Leave a ReplyCancel Reply

Trending now

4 men arrested after Mississippi mass shooting that killed 4, injured 20

Customer challenge

Satellites reveal the world’s secrets: calls, text messages, military and corporate data