A weekend “Vibe Code” hack by Andrej Karpathy quietly outlines the missing layer of AI orchestration in companies

This weekend, Andrei Karpathythe former director of AI at Tesla and founding member of OpenAI, decided he wanted to read a book. But he didn’t want to read it alone. He wanted to read it in the company of a committee of artificial intelligences, each of whom would present their own perspective, criticize the others, and finally, under the guidance of one, summarize a final answer "Chairman."

To achieve this, Karpathy wrote something he called a "Vibe Code Project" – a quickly written software, mostly by AI assistants, that is more for fun than functionality. He published the result, a repository called "LLM Council," to GitHub with a clear disclaimer: "I won’t support it in any way… Code is now ephemeral and libraries no longer exist."

But for technical decision-makers across the enterprise landscape, looking beyond the casual disclaimer reveals something far more meaningful than a weekend toy. In a few hundred lines python And JavaScriptKarpathy has designed a reference architecture for the most critical, undefined layer of the modern software stack: the orchestration middleware between enterprise applications and the volatile market of AI models.

As companies finalize their platform investments for 2026, LLM Council offers a reduced view of the "build vs. buy" Reality of AI infrastructure. It shows that while the logic of routing and aggregating AI models is surprisingly simple, the real complexity lies in the operational wrapper required to make them enterprise-ready.

This is how the LLM Council works: Four AI models discuss, criticize and synthesize answers

To the casual observer, that is LLM Council The web application looks almost identical to ChatGPT. A user enters a request in a chat box. But behind the scenes, the application triggers a sophisticated, three-step workflow that mirrors how human decision-making bodies work.

First, the system routes the user’s request to a group of boundary models. In Karpathy’s default configuration, this includes OpenAIs GPT 5.1Googles Gemini 3.0 ProAnthropics Claude Sonnet 4.5and xAIs Grok 4. These models generate their first reactions in parallel.

In the second step, the software carries out a peer review. Each model is fed anonymized responses from its peers and asked to rate them based on accuracy and insight. This step transforms the AI from a generator into a critic and enforces a level of quality control rarely seen in standard chatbot interactions.

Finally, a name "Chairman LLM" – currently configured as Google’s Gemini 3 – receives the original query, individual responses and peer rankings. It distills this amount of context into a single, authoritative answer for the user.

Karpathy noted that the results are often surprising. "Very often, models are surprisingly willing to choose another LLM’s response as superior to their own." he wrote on X (formerly Twitter). He described using the tool to read book chapters and found that the models consistently praised GPT-5.1 as the most insightful but rated Claude the worst. However, Karpathy’s own qualitative assessment differs from his digital advice; he found GPT-5.1 "too wordy" and preferred that "compacted and processed" Edition of Gemini.

FastAPI, OpenRouter, and the arguments for treating boundary models as interchangeable components

For CTOs and platform architects, the value of LLM Council lies not in its literary criticism, but in its construction. The repository serves as the primary document that shows exactly what a modern, minimal AI stack will look like at the end of 2025.

The application is based on a "thin" Architecture. The backend used FastAPIa modern one python Framework, while the frontend is a standard React Application created with Fast. Data storage is not done through a complex database, but through a simple one JSON files written to the local hard drive.

The linchpin of the entire operation is OpenRouteran API aggregator that normalizes the differences between different model providers. By routing requests through this single broker, Karpathy was able to avoid writing separate integration code for it OpenAI, GoogleAnd Anthropocene. The application doesn’t care which company provides the information; It just sends a prompt and waits for a response.

This design choice highlights a growing trend in enterprise architecture: the commercialization of the model layer. By treating Frontier models as replaceable components that can be swapped out by editing a single line in a configuration file – specifically the COUNCIL_MODELS list in the backend code – the architecture protects the application from vendor lock-in. If a new model of Meta or mistral If someone tops the leaderboards next week, they can be added to the council in seconds.

What’s missing from prototype to production: authentication, PII redaction and compliance

While the core logic of LLM Council is elegant, it also serves as a clear example of the divide between a "Weekend hack" and a production system. For an enterprise platform team, cloning the Karpathy repository is just the first step in a marathon.

A technical check of the code reveals what is missing "boring" Infrastructure that commercial providers sell at premium prices. The system lacks authentication; Anyone with access to the web interface can query the models. There is no concept of user roles, i.e. a junior developer has the same access rights as the CIO.

Furthermore, the governance layer is non-existent. In an enterprise environment, sending data to four different external AI providers at the same time raises immediate compliance concerns. There’s no mechanism here to redact personally identifiable information (PII) before it leaves the local network, nor is there an audit trail to track who asked what.

Reliability is another open question. The system assumes this OpenRouter API is always active and the models react promptly. It lacks circuit breakers, fallback strategies, and retry logic that keep mission-critical applications running if a provider fails.

These absences are not flaws in Karpathy’s code – he explicitly stated that he has no intention of supporting or improving the project – but they define the value proposition for the commercial AI infrastructure market.

Companies like LangChain, AWS bedrockand various AI gateway startups are essentially selling that "Hardening" around the core logic that Karpathy demonstrated. They provide the security, observability, and compliance wrappers that transform a raw orchestration script into a functional enterprise platform.

Why Karpathy believes code is now "transient" and traditional software libraries are outdated

Perhaps the most provocative aspect of the project is the philosophy under which it was built. Karpathy described the development process as "99% vibecoded," This means he relied heavily on AI assistants to generate the code, rather than writing it line by line himself.

"Code is now ephemeral and libraries are over. Ask your LLM to change it to your liking." he wrote in the repository’s documentation.

This statement marks a radical shift in software engineering capability. Traditionally, companies build internal libraries and abstractions to manage complexity and maintain them over years. Karpathy suggests a future where code is treated this way "immediate scaffolding" – disposable, easily rewritten by AI and not intended to last forever.

This represents a difficult strategic question for company decision-makers. If internal tools are possible "Mood coded" Does it make sense to buy expensive, rigid software suites for internal workflows in a weekend? Or should platform teams empower their engineers to create custom, disposable tools that meet their exact needs for a fraction of the cost?

When AI Models Judge AI: The Dangerous Gap Between Machine Preferences and Human Needs

Beyond the architecture is the LLM Council The project inadvertently sheds light on a specific risk in automated AI deployment: the divergence between human and machine judgment.

Karpathy’s observation that his models preferred GPT-5.1 while he preferred Gemini suggests that AI models may have common biases. They may prefer verbosity, specific formatting, or rhetorical confidence, which don’t necessarily align with human business needs for brevity and precision.

As companies increasingly rely on it "LLM-as-judge" If you use systems to assess the quality of your customer-facing bots, this discrepancy is significant. When the automated rater rewards consistently "wordy and rambling" Answers While human customers want succinct solutions, the metrics show success while customer satisfaction drops. Karpathy’s experiment suggests that relying solely on AI to evaluate AI is a strategy fraught with hidden alignment problems.

What enterprise platform teams can learn from a weekend hack before building their stack for 2026

Ultimately, LLM Council acts as a Rorschach test for the AI industry. For the hobbyist, it is a fun way to read books. It represents a threat to the vendor and proves that the core functionality of their products can be replicated in a few hundred lines of code.

But for the technology leader in the company, it is a reference architecture. It demystifies the orchestration layer and shows that the technical challenge lies not in routing the prompts, but in managing the data.

As platform teams prepare for 2026, many will likely be staring at Karpathy’s code, not to deploy it, but to understand it. It proves that a multi-model strategy is not technically unattainable. The question remains whether companies will build the governance layer themselves or pay someone else to do it "Vibe code" in Enterprise class armor.