I tested a local LLM against a frontier cloud model, and the gap was smaller than I expected

I tested a local LLM against a frontier cloud model, and the gap was smaller than I expected


A couple of months ago I wrote about Qwen 3.6 27B running locally on my home server PC and how it was quietly closing in on the cloud. It wasn’t winning across the board, but on everyday work it was close enough that I kept reaching for it instead. So I set up a rematch, this time with questions built to be hard.

The setup was identical on both sides: Qwen 3.6 27B on my home server (llama.cpp, plus SearXNG for web search) against GPT-5.5, same prompts, same tools, all run through Pi. Five sets of questions across different categories, each answer scored on whether it actually solved the problem in front of it.

The gap is smaller than I expected. In one test, with 90,000 tokens of context and zero contamination, the local model gave the better answer. This isn’t local models beating the cloud, because they don’t overall. It’s messier than that, but Qwen gives it some fantastic competition.

Qwen’s long-context retrieval surprised me right out of the gate

It did a stellar job

I tested a local LLM against a frontier cloud model, and the gap was smaller than I expected
Qwen 3.6 27B

I started with the test I thought would break it. I exported a git log from a large private repo, nothing either model could have seen in training, pasted the same 90,000-ish tokens into both, and asked:

Does the sitemap system have a cap on paginated items, and does the magazines sitemap added earlier respect it?

Then a follow-up:

What happens to the sitemap system when the limit is reached?

The private repo is the important part of the test, as a public one might sit in either model’s training data, which turns a correct answer into a memory test. This log can’t be memorized. To answer it, a model has to hold distant pieces together: MAX_PAGINATED_ITEMS is defined in one file, the safeFetch pagination loop that uses it lives in another, and the magazines route that calls safeFetch sits in a third, separated by thousands of lines, with none of the relationships spelled out.

gpt 5.5 90k token context question
GPT-5.5

Both got the core chain right. Magazines goes through safeFetch, safeFetch is capped by MAX_PAGINATED_ITEMS at 10,000, so magazines inherits the cap. Hit the cap and the loop stops, excess items vanish silently, and a console.warn fires. Clearing that baseline was already impressive. Then Qwen kept going.

It described what actually happens past the cap: the system returns a valid 200 with well-formed XML, so search engines see a clean sitemap that’s quietly truncated, with no automatic split into sitemap-magazines-1.xml or anything else. Then it surfaced something I hadn’t asked about, but it was still very relevant: the podcast page reuses that same MAX_PAGINATED_ITEMS constant in its own loop instead of going through safeFetch with its own warning, too. Qwen didn’t hallucinate that, it just scanned enough of its context to point out a relationship I never asked about, even though the answer was relevant.

GPT-5.5, meanwhile, gave me my favorite moment of the whole test. With the same bash and file tools Qwen had, it didn’t read the pasted log at all. It ran grep against my actual filesystem, found nothing in the working directory, and returned zero matches. I stopped it and restarted in a clean session. Both had the same tools, but the opposite instinct: the cloud model tried to read the filesystem, whereas the local one just read what was put in front of it.

Both models could research

But one didn’t tunnel vision

The two hardware questions turned out to be mirror images. First:

What GPU should I buy to start running local LLMs, on a tight budget? Be sure to research.

I was intentionally vague with this question by specifying a “tight budget,” as I wanted to see what each model came back with. Both came back to me with the same card: a used Nvidia RTX 3060 12GB, which to be fair, is the community’s gold standard for cheap inference. The tie ended there, though, as GPT-5.5 named the card, then padded with a stretch list of pricier, CUDA-only options and a closing note about system RAM I hadn’t asked for. Qwen sorted its answer into tiers: under $250, mid-range, and high end, explaining what each VRAM bracket actually buys you in model sizes and quant types. It even surfaced the cheapest path that GPT-5.5’s CUDA-only framing skipped.

In a new session, I changed the line of questioning:

I want to start self-hosting. What’s the cheapest mini PC or SBC I should buy to run a few Docker containers and Home Assistant? Be sure to research.

Here “cheapest” was a constraint that “tight budget” wouldn’t really carry the meaning of. GPT-5.5 took it literally, found the lowest sale price (a Dell Wyse 5070), and threw in an unrequested software-stack walkthrough. Qwen, meanwhile, took “cheapest” to mean “cheapest that will actually run things,” and recommended an Intel N100-based mini PC alongside several other options, including a Raspberry Pi 5. It cited a real source for the N100 pick, and offered to plan the stack based on my choice rather than dumping it on me right away.

To be fair, in this test, neither model was really wrong. I personally value Qwen’s answer more rather than GPT-5.5’s, as someone asking about the cheapest mini PC or SBC for running Docker containers and Home Assistant wants something that will actually, well, run those things. The Wyse 5070 with its Intel Celeron J4105 is significantly outclassed by the N100, but with that said, GPT-5.5 did find the “cheapest” option per the literal meaning of the request.

Neither model invented a fake quant format

But one found where my “error” may have come from

Asking Qwen 3.6 27B about a non-existent INT3_K_XL model
Qwen 3.6 27B

I wrote a hallucination trap to try and catch both models, and both handled it exceptionally well.

Which Qwen 3.6 27B quant uses the new INT3_K_XL format? Be sure to research.

To be clear, INT3_K_XL doesn’t exist. Qwen 3.6 27B has three official options at that level of quantization: Q3_K_S, Q3_K_M, and Q3_K_L. There’s no XL, and the prefix is Q, not INT. I built the question to bait one of the models into hallucinating, but neither actually did.

Asking GPT-5.5 about a non-existent INT3_K_XL model

As hoped, though, both searched and found nothing, and suggested that I had meant Q3_K_XL. GPT-5.5 stopped at this point, but Qwen went looking for where my confusion came from, and actually found a rather niche model that I might have actually seen, based on the question: DASH-Q INT3 builds of this exact model on HuggingFace, plus Qwen’s own INT3 compression work with custom fused Metal kernels for Apple Silicon. In this case, Qwen showed me the real things it suggested my memory had mashed together and explained how I’d glued the wrong prefix to the wrong suffix.

If I’d been genuinely confused, that’s the answer that would have actually fixed it.

The four-sentence test

A simple constraints check

Asking Qwen 3.6 27B to explain why dense models perform better than MoE models
Qwen 3.6 27B

This last one was also all about constraints, to see how both models would handle it.

Explain why dense models like Qwen 3.6 27B can beat larger MoE models on some tasks. Exactly 4 sentences, no sentence over 15 words, don’t use the word ‘parameter.’

Asking GPT-5.5 to explain why dense models perform better than MoE models
GPT-5.5

Both managed to stay within the constraints perfectly, and GPT-5.5 was tighter on word count, with its longest sentence at 10 words. Qwen was more mechanistic, explaining how dense models activate every weight per token, building richer representations. It then said that MoE’s sparse routing leaves most of the network idle, while unified gradient flow gives smoother training than fragmented expert specialization.

While I would lean towards the answers on this one being a tie, I think Qwen takes the edge here as well. Saying “Activate every weight, creating richer representations” tells you why full activation helps. GPT-5.5 had one strong causal sentence, routing failing and wasting specialized knowledge, but paired it with one that just lists outcomes (“improves coherence, reasoning, and instruction following”) with no mechanism behind them. For “explain why,” Qwen’s answer was just… better.

Where Qwen struggled

It’s not perfect

Open WebUI in a browser, showing Qwen3 Coder Next to the right

While you may come out of this thinking Qwen 3.6 27B might actually be the stronger model, it wasn’t a clean victory. For example, Qwen’s GPU pricing was directionally right but not perfectly current, and it overclaimed on which model sizes fit a given VRAM budget at specific quants. Those are the cracks a local model with bolt-on web search still shows, but it manages to stay incredibly competitive.

A year ago, a local model on consumer hardware competing with a frontier cloud model on anything past a canned benchmark was wishful thinking. When I tested this for the first time a while ago, to me, the surprise was that Qwen was close at all. It’s pretty interesting to say that you need to carefully build a question to find the actual gap between them.

It’s worth noting that tool use only helps if a model knows it needs a tool, knows what to look for, and senses when its own knowledge falls short. That instinct is follows from the breadth of the search: you can only feel the edge of what you know if you know enough to find it. My questions made the need to search obvious, which suits Qwen. On more niche problems, where recognizing that you need to look something up takes knowing the field already, GPT-5.5’s broader knowledge will almost certainly give it the sharper instinct in niche cases, but those are the exception. Most of what people actually ask sits in the day-to-day questions where Qwen already holds its own.

There’s a threshold for a model to reach where it’s good enough to stop using the cloud, and that threshold has moved a lot closer with a lot more local models than people think. For the real work that people actually do with these models, there’s no longer as much of a gap as there used to be.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *