I feel that “narrowing” or “widening” isn’t quite accurate—it’s more like the field is stratifying. But if I absolutely had to pick one, I’d say the gap hasn’t narrowed; if anything, it may have widened.
While the surface-level “benchmark scores” appear to be converging, the deeper “system-level” gap is likely widening in a hidden way.
Put differently: the gap in static “knowledge capacity” is indeed narrowing, but in dynamic “reasoning ability” and systems engineering, the gap is quietly expanding.
First, we must acknowledge that open-source models (I won’t name specific ones to avoid seeming promotional) now post impressive benchmark results—sometimes even trading blows with GPT, Gemini, and Claude. But there’s a subtle logical trap here that’s easy to overlook. Academia calls this the “illusion of imitation learning.”
Why do I say that? Think about it: when you deploy an open-source model into your own chat application, does it really perform on par with those “Big Three”? Anyone who’s actually tried it knows the answer: no, it doesn’t. Why is that?
Let’s start with funding—because training these models burns cash like nothing else, so money is paramount.
According to CB Insights, from 2020 to now, closed-source model developers have raised $37.5 billion in funding, while open-source efforts have only secured $14.9 billion. And that’s just funding—the actual burn rate is even more extreme. OpenAI reported a $5 billion loss in 2024 but projects $11.6 billion in revenue for 2025. This “burn cash first to dominate the market” strategy is simply unaffordable for the open-source ecosystem.
Second, inference details.
The closed-source giants—especially OpenAI with its o1 series—have already changed the game. They’re no longer just competing on “how big the model is,” but on inference-time compute. Take o1, for example: it’s not a single model, but a complex system combining a base model, search strategies, and reinforcement learning verifiers. Before giving you an answer, it might run dozens of invisible “try-fail-correct” cycles in the background. This is impossible for open-source models to replicate, because it’s not just one forward pass of a model—it’s a multi-stage reasoning pipeline.
Look at Google too: aside from preview versions, their current APIs no longer expose the full chain-of-thought (CoT) reasoning traces. And even some preview models have been taken offline. That means these internal mechanics are no longer public—you can’t copy what you can’t see.
Third, data.
You can imitate architectures, reproduce algorithms, and even replicate papers—but how do you get the data?
Closed-source companies hold massive troves of real user interaction data: millions of daily conversations, with explicit signals like thumbs-up or thumbs-down. This creates a powerful data flywheel. I know for a fact that Google and Anthropic’s terms of service include clauses allowing them to use user dialogues for training—meaning they’re already using live conversations for RLHF (Reinforcement Learning from Human Feedback). This advantage is nearly impossible for open-source ecosystems to match.
I haven’t scrutinized OpenAI’s terms closely, so I won’t claim certainty—but haven’t you seen ChatGPT occasionally ask you to choose between two answers? What do you think that’s for?
Every time you say, “No, this code throws an error—you should look in this direction…” you’re essentially doing high-quality, free RLHF for closed-source companies. This stream of real human intent, correction logic, and contextual nuance is something open-source models simply cannot access. As a result, open-source models often score high on “IQ” (e.g., acing exams) but lack “EQ”—they struggle with implied meaning or handling ambiguous, real-world instructions like a human would.
Finally—and I think this is the most critical point—engineering barriers.
Today’s large models are no longer just neural network weights; they’re evolving into complex software systems.
Claude 4 uses a mixture-of-reasoning approach. In certain benchmarks, it employs an “extended thinking” feature and is explicitly encouraged to write out its reasoning process over multiple dialogue turns.
GPT-5 uses a routing system that automatically switches between a “fast, non-reasoning mode” and a “deep reasoning mode.” Users see only one GPT-5 option, but behind the scenes, different sub-models are dynamically allocated.
Google’s Gemini 1.5 Pro handles 2 million—even up to 10 million—tokens. This isn’t just about model architecture; it relies on engineering marvels like Ring Attention, which enables distributed computation across thousands of TPUs with ultra-low latency interconnects. And remember—that’s just Gemini 1.5.
Even if you were given the full model code, without access to a massive GPU/TPU cluster with specialized low-latency networking, you simply couldn’t run it. This physical-layer gap objectively raises the ceiling for cutting-edge capabilities. Can the open-source community replicate these system-level innovations? Clearly not.
So I’d argue we’re seeing a two-tier stratification: frontier models are dominated by well-funded closed-source players (OpenAI, Anthropic, Google—the Big Three), while smaller, use-case-specific or edge-deployable models are supported by a growing open-source ecosystem.
In conclusion:
Evaluating new LLMs is becoming increasingly difficult, and there’s no clear new methodology emerging. Look—LLM Leaderboard has already shut down because it lost relevance. The “catch-up” by open-source models may be an illusion: even with strong benchmark performance, real-world shortcomings are likely underestimated.
What we’re witnessing might be surface convergence but deep divergence: benchmark gaps are narrowing, but gaps in systems engineering, data flywheels, and productization depth are likely widening.
And due to the iceberg effect, the parity we observe between open and closed models may just be the tip. The true capabilities of closed-source systems—multi-model coordination, real-time learning, complex reasoning pipelines—remain largely undisclosed. (Anthropic has even stated they hold back their largest models, releasing only distilled, slightly smaller versions of their strongest systems.)
This situation reminds me of the early smartphone era.
If your needs are “above baseline”—writing an email, summarizing a document, generating routine code—the gap between open and closed models is negligible, and open-source often offers far better cost efficiency.
But if you require extremely complex logical reasoning, long-chain scientific exploration, or pinpoint accuracy in million-token contexts, those undisclosed reasoning strategies, data flywheels, and engineering architectures mean closed-source models are pulling away—fast.
The “narrowing gap” we see is mostly the democratization of baseline capabilities. But at the true frontier—the “no man’s land” of AI—walls are being built higher, not lower. |