I built Andrej Karpathy’s “LLM Council” on my own hardware, and now no single model gets the last word

I built Andrej Karpathy’s “LLM Council” on my own hardware, and now no single model gets the last word


I have been in an on-and-off relationship with local LLMs. Initially, I tried 1st-gen or 2nd-gen 8B models, but they were not always up to my expectations. I gave up on them and moved back to cloud models. A couple of weeks ago, I gave local LLMs another shot. I even tested three local LLMs side by side on my RTX 4070 Ti and decided to keep one of them as the default. Not because that one single model beat the other two, but because running all three side by side for every single prompt wasn’t practical. Every model had its own strengths, though, so I kept the other two around and reached them whenever the prompt really mattered. That got me thinking: what if I could use all three to get better output? That’s when I came across Karpathy’s LLM Council, initially designed to work across cloud APIs. I liked the idea and decided to rebuild it around local models running on my own hardware.


I built Andrej Karpathy’s “LLM Council” on my own hardware, and now no single model gets the last word


I tested 3 local LLMs on my RTX 4070 Ti for real work — only one earned a permanent spot

12GB VRAM, one model worth keeping.

One answer wasn’t enough

Picking a winner didn’t end the argument

Let’s talk a bit about the comparison I did a couple of weeks ago. My setup was straightforward. I used my RTX 4070 Ti as the base for the local LLMs. Three local LLMs — DeepSeek-R1 8B, Qwen 3.5 9B, and Gemma 4 E4B — were running via Ollama. And for a familiar interface, I used Open WebUI. I used similar sets of instructions for each model and tried to find the best of the three. I ran four different categories of prompts, and each model performed differently.

DeepSeek was good at reasoning, but it hallucinated under pressure. Qwen felt more knowledgeable than the other two, but it was often verbose, burying good answers in length. Finally, Gemma 4 was good at organizing and synthesizing information, but it still wasn’t always correct. So, in the end, I settled with Gemma 4, but whenever a prompt actually mattered, I still navigated to different models for a final decision. I wasn’t looking for more answers for the same question; I was looking for confidence in the one I’d get. And I kept finding that where DeepSeek spotted something Gemma didn’t, and sometimes Qwen mentioned an edge case the others ignored. At some point, I started to act like a judge, comparing all three answers and making a decision.

The friction of manually comparing three long responses each time pushed me toward LLM Council. Karpathy, in his LLM Council, proposed that instead of asking every cloud model the same question, you could group them into a council and let each model defend its answer and critique everyone else’s before one model wrote the final response. The idea instantly clicked, as I was doing the same thing: I was asking all three models the same questions and manually judging each long response. That’s when I decided to at least try the idea; if I could replicate it for my local LLMs, it would be a jackpot — privacy, no API costs, and my own hardware. But making software designed for cloud interfaces work on a single 12GB GPU turned out to be more interesting and more challenging than it looked.

Three models, one GPU, and no idea if this would even work

Built for the cloud, running on one GPU

Karpathy’s idea was a simple three-stage process. Fed the same prompt independently to each model and generated output from each in the first stage simultaneously. Then, in the second stage, each model anonymously reviewed and ranked the others’ answers. Finally, in the last stage, the designated chairman model reviewed everything, produced the final answer, and presented it to the end-user. At first glance, it seemed like the perfect solution — no tab-switching between conversations, no reading three long responses separately, and no manually working out which parts to trust.

But the implementation wasn’t a simple git clone and run — it didn’t just work out of the box. Karpathy’s implementation expected cloud APIs either via OpenRouter or OpenAI-compatible endpoints, and in the first stage, it called every model in parallel. But in my case, since I was working with local models, I swapped those API calls for Ollama’s OpenAI-compatible endpoint. And instead of cloud models, I pointed my three local LLMs at it. In the original implementation, the design already had the chairman double as a council member, so I kept that structure. Gemma 4 became both a councilman and the chairman.

The overall adaptation was straightforward; the code didn’t require major changes because Ollama exposed a compatible API. But the problem first appeared when I ran it. The original version was designed to work with cloud inference simultaneously, meaning it was for cloud providers with dedicated hardware. And when I sent the first request, it ran all three models simultaneously, each with 8B-9B parameters, on my RTX 4070 Ti with 12GB VRAM. That was never going to work. One model silently dropped out — no error, just missing response. Then I tweaked the workflow to process each model sequentially rather than firing requests in parallel.

It took more time to complete each stage, obviously, but at least I could now expect a response from each model. Once it was running reliably, the real question wasn’t whether it worked; it was whether debating three models was actually better than simply asking Gemma directly.

The winner wasn’t a model

Three votes, three winners, zero consensus

Choosing Gemma as a chairman wasn’t a coincidence. I chose it because Gemma had already proved to be the strongest synthesizer. In Karpathy’s original implementation, the chairman model was also one of the council members, so Gemma stayed a full contender and did double duty as chairman. Gemma first generated its own response independently in stage 1, then compared all three in stage 2, and finally gave an answer in stage 3.

I also overhauled the UI to feel like a familiar, modern chat app, with features such as stage color-coding, a street cred leaderboard, a clear separation of stages, and a few UX changes. These small UI-UX changes made it feel like a real workflow; instead of staring at a wall of raw outputs, I could actually follow how the council reached its conclusion. It started to feel more like a polished cloud AI app than a forked GitHub project on my own hardware.

Okay, coming back to the real test. One of the tests involved the Cloudflare Tunnel vs. Pangolin prompt. And the street cred leaderboard produced an unexpected result: a perfect three-way tie. It wasn’t because they agreed. Each model ranked itself first while disagreeing on how to order the others; that’s the real answer.

Technically, they all failed, but that failure meant nothing in the third stage. The chairman simply didn’t care about the rankings; it was only reading the review behind them. DeepSeek produced the most detailed practical deployment details, while Qwen framed the trade-offs in a more approachable way. And Gemma added the structural clarity that tied everything together. The council’s biggest strength wasn’t the leaderboard at all; it was getting them to read each other’s self-assessments and produce something better.


A MacBook air connected to a monitor running DeepSeek-R1 locally


7 things I wish I knew when I started self-hosting LLMs

I’ve been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.

Synthesis turned out to be the real skill

I learned a few lessons from the whole experiment. Synthesis is a different skill from generation and definitely different from self-grading. I am not saying that the council is smarter than any single local model. But for most tasks, I would still open Gemma. For anything complex enough to need a second opinion, I would reach out to the council. It’s slower, and it’s heavier on a single 12GB GPU than asking one model ever was. Because instead of manually comparing three different responses myself, the evaluation happens even before I see the final answer.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *