Is the gap between open-source and closed-source large language models narrowing or widening?

发表主题

^- 频道: 时尚

美食

影视

游戏

动漫

娱乐

体育

旅游

摄影

户外

科技

购物

新闻

财经

电商

出海

职场

移民

留学

汽车

宠物

文化

健康

军事

历史

公路边 › 频道 › 科技 › Is the gap between open-source and closed-source lar ...

Khloe 发表于昨天 06:32 | 显示全部楼层 |阅读模式

Is the gap between open-source and closed-source large language models narrowing or widening?

晚来天欲雪 发表于昨天 06:32 | 显示全部楼层

My overall impression is that the gap is widening—not in terms of raw intelligence in pure text scenarios (where it’s actually narrowing), but in terms of the user experience of productized models.

This isn’t because open-source models are inherently inferior. Rather, it’s because their runtime environments can’t match those of closed-source models. In fact, many open-source models don’t even have their own dedicated runtime environments and often rely on infrastructure open-sourced by closed-source vendors. As a result, their real-world performance often falls far short of expectations—even when their benchmark scores at release were close to those of leading closed-source models.

Over the past year, I’ve transitioned from using consumer-facing apps to building production products via APIs, testing all mainstream open- and closed-source models. The ones I’ve used intensively include Gemini-2.5/3, GPT-5/5.1, Claude-4.5-Sonnet/4.1-Opus, DeepSeek-v3 0324/v3.1, Qwen-Coder, Kimi K2, and GLM-4.5/4.6.

I’ve tried swapping models into Claude Code many times this year. Yet, consistently, the best experience comes from Claude’s own model. Other closed-source models and domestic open-source models, when plugged into Claude Code for pure-text tasks, deliver fairly comparable results—but all noticeably lag behind the original. When it comes to multimodal inputs (e.g., adjusting frontend code based on a screenshot), most domestic open-source models are essentially blind, as they only support plain text. This remains a glaring weakness.

The same pattern holds true for Codex or Gemini-cli: native models always integrate best. Swapping in any other model—open or closed—feels like Cinderella’s stepsisters forcing their feet into the glass slipper: technically possible, but awkward and uncomfortable.

In 2025, the hottest concepts are Vibe Coding, AI Agents, and Context Engineering.

These trends send a clear signal: today’s mainstream LLMs have reached an inflection point where both capability and cost efficiency enable genuinely valuable products. Many of these products are essentially “LLM Wrappers”—layers built around a base model that add prompts, context management, tools, permissions, and extensions (e.g., MCP protocols). Common forms include code editors, CLI tools, and browser integrations.

Cursor is a relatively neutral LLM wrapper—it’s not tied to any single AI vendor and supports all the models mentioned above within its own environment.

Using Cursor as a benchmark, it’s increasingly hard to distinguish between open- and closed-source models on development tasks of low-to-medium complexity (defined here as implementing a new feature within a single module of a large codebase). For such tasks, if GPT-5/5.1 can get it done in one turn, Kimi K2 or GLM-4.6 can usually achieve the same—perhaps requiring two turns occasionally. But I attribute this inconsistency more to environmental differences (e.g., familiarity with GitHub vs. Gitee) than core model capability.

However, on high-complexity tasks (e.g., cross-module changes involving database schema updates), closed-source models generally outperform open-source ones. Reasons include longer context windows (open-source models typically cap at 128K–256K, while closed-source models range from 200K to 1M+) and multimodal capabilities—most top open-source models remain “blind,” whereas leading closed-source models universally support vision.

Google’s AntiGravity deserves a mention here. Despite its flaws (e.g., opening a new browser window for every localhost check instead of refreshing the existing one), its most revolutionary feature is a Chrome extension that enables frontend test automation and reflects results directly in the code editor. This saved me enormous effort—previously, achieving similar functionality required either custom MCP setups (like chrome-devtools) or manual screenshot uploads.

Returning to the main thread: since Llama 4’s release this spring, the open-source model release cadence has largely been driven by Chinese vendors. My sense is that everyone is shipping models rapidly, yet almost all treat releases as “drop a model, post a leaderboard score, and move on.” Few are investing in building polished products around their own models. Even the revamped Qwen app remains underwhelming—ironically, Doubao now stands out as one of the better consumer experiences.

I’m not sure how much credibility benchmarks still hold. Most users judge model quality through ready-made apps—not by downloading Cursor to compare coding performance side-by-side. And in the consumer app space, the poor experience of domestic models stems more from the product layer than the underlying model.

A few exceptions worth noting are MetaAI Search and the recently launched LingGuang. MetaAI Search, combined with its “Learn Something Today” feature, delivers a NotebookLM-like experience tailored for Chinese users. LingGuang takes a novel approach, cleverly leveraging this year’s advances in LLM-generated HTML to create engaging interactive pages.

In summary: open-source models have nearly caught up with closed-source counterparts in pure-text scenarios—but that’s as far as it goes. Their consumer apps consistently underdeliver, often reduced to uninspired model showcases. On the B2B side, a rare few enterprises successfully deploy open-source models internally with impressive results—but since these are internal tools, their excellence rarely reaches the outside world.

Runtime environment ultimately determines real-world performance. This trend will likely intensify next year. Benchmark scores for open-source models will become increasingly irrelevant. Instead, people will ask: “In what environment can I actually reproduce the claimed ‘beats X, crushes Y’ performance?”

If the answer is “only on your high-end local machine” or “only if deployed perfectly elsewhere,” then users with means will naturally gravitate toward closed-source models—because closed-source vendors continuously refine and update their runtime environments as part of their paid offering.

Meanwhile, how many of the open-source projects that surged in 2023 are still actively maintained and widely used today? (Yes, I’m looking at you, GPT4All—no updates in over seven months.)

It’s hard to imagine open-source model developers continuing to focus solely on uploading weights and leaderboard screenshots to Hugging Face while closed-source players are already optimizing entire ecosystems around their models. Such “muscle-flexing” releases bring diminishing returns to end-user experience.

鼻酸心头连 发表于昨天 06:34 | 显示全部楼层

I feel that “narrowing” or “widening” isn’t quite accurate—it’s more like the field is stratifying. But if I absolutely had to pick one, I’d say the gap hasn’t narrowed; if anything, it may have widened.

While the surface-level “benchmark scores” appear to be converging, the deeper “system-level” gap is likely widening in a hidden way.

Put differently: the gap in static “knowledge capacity” is indeed narrowing, but in dynamic “reasoning ability” and systems engineering, the gap is quietly expanding.

First, we must acknowledge that open-source models (I won’t name specific ones to avoid seeming promotional) now post impressive benchmark results—sometimes even trading blows with GPT, Gemini, and Claude. But there’s a subtle logical trap here that’s easy to overlook. Academia calls this the “illusion of imitation learning.”

Why do I say that? Think about it: when you deploy an open-source model into your own chat application, does it really perform on par with those “Big Three”? Anyone who’s actually tried it knows the answer: no, it doesn’t. Why is that?

Let’s start with funding—because training these models burns cash like nothing else, so money is paramount.

According to CB Insights, from 2020 to now, closed-source model developers have raised $37.5 billion in funding, while open-source efforts have only secured $14.9 billion. And that’s just funding—the actual burn rate is even more extreme. OpenAI reported a $5 billion loss in 2024 but projects $11.6 billion in revenue for 2025. This “burn cash first to dominate the market” strategy is simply unaffordable for the open-source ecosystem.

Second, inference details.

The closed-source giants—especially OpenAI with its o1 series—have already changed the game. They’re no longer just competing on “how big the model is,” but on inference-time compute. Take o1, for example: it’s not a single model, but a complex system combining a base model, search strategies, and reinforcement learning verifiers. Before giving you an answer, it might run dozens of invisible “try-fail-correct” cycles in the background. This is impossible for open-source models to replicate, because it’s not just one forward pass of a model—it’s a multi-stage reasoning pipeline.

Look at Google too: aside from preview versions, their current APIs no longer expose the full chain-of-thought (CoT) reasoning traces. And even some preview models have been taken offline. That means these internal mechanics are no longer public—you can’t copy what you can’t see.

Third, data.

You can imitate architectures, reproduce algorithms, and even replicate papers—but how do you get the data?

Closed-source companies hold massive troves of real user interaction data: millions of daily conversations, with explicit signals like thumbs-up or thumbs-down. This creates a powerful data flywheel. I know for a fact that Google and Anthropic’s terms of service include clauses allowing them to use user dialogues for training—meaning they’re already using live conversations for RLHF (Reinforcement Learning from Human Feedback). This advantage is nearly impossible for open-source ecosystems to match.

I haven’t scrutinized OpenAI’s terms closely, so I won’t claim certainty—but haven’t you seen ChatGPT occasionally ask you to choose between two answers? What do you think that’s for?

Every time you say, “No, this code throws an error—you should look in this direction…” you’re essentially doing high-quality, free RLHF for closed-source companies. This stream of real human intent, correction logic, and contextual nuance is something open-source models simply cannot access. As a result, open-source models often score high on “IQ” (e.g., acing exams) but lack “EQ”—they struggle with implied meaning or handling ambiguous, real-world instructions like a human would.

Finally—and I think this is the most critical point—engineering barriers.

Today’s large models are no longer just neural network weights; they’re evolving into complex software systems.

Claude 4 uses a mixture-of-reasoning approach. In certain benchmarks, it employs an “extended thinking” feature and is explicitly encouraged to write out its reasoning process over multiple dialogue turns.

GPT-5 uses a routing system that automatically switches between a “fast, non-reasoning mode” and a “deep reasoning mode.” Users see only one GPT-5 option, but behind the scenes, different sub-models are dynamically allocated.

Google’s Gemini 1.5 Pro handles 2 million—even up to 10 million—tokens. This isn’t just about model architecture; it relies on engineering marvels like Ring Attention, which enables distributed computation across thousands of TPUs with ultra-low latency interconnects. And remember—that’s just Gemini 1.5.

Even if you were given the full model code, without access to a massive GPU/TPU cluster with specialized low-latency networking, you simply couldn’t run it. This physical-layer gap objectively raises the ceiling for cutting-edge capabilities. Can the open-source community replicate these system-level innovations? Clearly not.

So I’d argue we’re seeing a two-tier stratification: frontier models are dominated by well-funded closed-source players (OpenAI, Anthropic, Google—the Big Three), while smaller, use-case-specific or edge-deployable models are supported by a growing open-source ecosystem.

In conclusion:

Evaluating new LLMs is becoming increasingly difficult, and there’s no clear new methodology emerging. Look—LLM Leaderboard has already shut down because it lost relevance. The “catch-up” by open-source models may be an illusion: even with strong benchmark performance, real-world shortcomings are likely underestimated.

What we’re witnessing might be surface convergence but deep divergence: benchmark gaps are narrowing, but gaps in systems engineering, data flywheels, and productization depth are likely widening.

And due to the iceberg effect, the parity we observe between open and closed models may just be the tip. The true capabilities of closed-source systems—multi-model coordination, real-time learning, complex reasoning pipelines—remain largely undisclosed. (Anthropic has even stated they hold back their largest models, releasing only distilled, slightly smaller versions of their strongest systems.)

This situation reminds me of the early smartphone era.

If your needs are “above baseline”—writing an email, summarizing a document, generating routine code—the gap between open and closed models is negligible, and open-source often offers far better cost efficiency.

But if you require extremely complex logical reasoning, long-chain scientific exploration, or pinpoint accuracy in million-token contexts, those undisclosed reasoning strategies, data flywheels, and engineering architectures mean closed-source models are pulling away—fast.

The “narrowing gap” we see is mostly the democratization of baseline capabilities. But at the true frontier—the “no man’s land” of AI—walls are being built higher, not lower.

Zola 发表于昨天 06:35 | 显示全部楼层

The gap is clearly narrowing—I’d even say that for every one step the U.S. takes, China takes one and a half.

I suspect your question refers to the gap between other models (especially Chinese ones) and the “Big Four” closed-source models: Gemini, ChatGPT, Claude, and Grok. I’ll answer based on that understanding.

Here are several clear signs of progress:

① DeepSeek officially reported that its model achieved International Mathematical Olympiad (IMO) gold-medal level performance—and publicly released both its methodology and model weights.

Liang Zi and team have made a qualitative leap in the area of “reasoning” by replacing crude verification mechanisms with more sophisticated approaches.

② Qwen’s 235B multimodal model delivers top-tier image-to-text recognition accuracy and precision. Moreover, Qwen’s research on Gated Attention for LLMs won a Best Paper Award at NeurIPS 2025.

This signals that Chinese research is undergoing independent evolution—true indigenous innovation—with remarkably fast technology transfer into real products.

③ Kimi and Doubao have achieved world-class accuracy in AI search (outside of Deep Research mode).

This is a practical, application-level achievement. While ChatGPT is renowned for its polished user experience, Chinese teams are proving equally competitive in real-world deployment.

④ Doubao Seed’s text-to-image generation consistently ranks among the world’s best in subjective leaderboards. Just recently, Qwen’s new 6B text-to-image model caused a sensation—many found it astonishing.

In multimodal generation, Chinese AI teams are now exploring more nuanced, technically sophisticated techniques. This shows a shift toward deep research rather than relying solely on brute-force scaling.

⑤ In coding, Kimi, GLM, and M2 are now routinely mentioned as viable alternatives to top-tier agentic coding models—a notion that would have been unthinkable just last year.

I think many were momentarily stunned by the sheer scale of Gemini 3 Pro, but parameter scaling isn’t decisive across all AI applications. Finer-grained agent workflows, novel reasoning architectures, and advanced validation methods—these aren’t governed by scale alone.

Areas where gaps still exist:

① The quality of AI responses remains somewhat rough. Responses often lack inspirational depth, broad perspective, and robust hallucination control. Addressing this requires fine-tuning by highly skilled, domain-savvy professionals.

② Understanding of RL-based agents hasn’t yet reached Anthropic’s level. Anthropic frequently draws intelligent inspiration from philosophy and human behavioral science. Pure engineering thinking alone can’t solve frontier AI problems—because in AI, there is no pre-existing path to follow.

③ Native multimodal applications like Nano Banana remain unexplored territory in China. Yet such apps have proven extremely effective at driving mobile install volumes. Whoever wins this race will capture significant traffic红利 (traffic红利 = traffic dividend). But success requires mature understanding of omni-modal models—only then can one effectively fuse multimodal token inputs and outputs. This will be the defining battleground for Chinese AI in the coming year.

一笑人间万事 发表于昨天 06:36 | 显示全部楼层

It depends on the timeframe you’re comparing against.

Compared to 2024, the gap has clearly narrowed. Back then, GPT-4o and o1 dominated everything, and Chinese models didn’t have a single competitive contender. But now in 2025, the gap is undeniably smaller. The clearest proof? Zhipu, Kimi, and MiniMax have all launched coding-focused subscription packages—and a meaningful number of users are actually paying for them. That would’ve been unthinkable in 2024.

However, if we compare to the moment DeepSeek-R1 was released, the gap has indeed widened. R1, while still slightly behind GPT-4o, was highly competitive when accounting for accessibility and cost—it could genuinely go toe-to-toe with 4o. That marked the first time a Chinese model truly challenged closed-source leaders. Today, however, GPT-5/5.1, Gemini 3 Pro, and Claude 4.5 have pulled ahead by a full stride—even Grok, which used to be widely considered subpar, now outperforms most Chinese models.

But it’s premature to declare whether the gap is definitively narrowing or widening. The current lag of Chinese models stems from complex factors—it’s unfair to write them off entirely.

First, Chinese models falling behind by a “stride” is actually the norm; R1’s breakthrough was the exception. To exaggerate slightly: among Chinese vendors, only DeepSeek truly possesses unique insight into LLMs. Without DeepSeek’s open-sourcing efforts, most other Chinese players likely wouldn’t even know how to implement chain-of-thought reasoning. They can only follow DeepSeek’s lead or mimic foreign innovations—and translating those insights into their own products takes time. A 3+ month lag behind leading closed-source models should be expected as standard.

Moreover, DeepSeek is fundamentally a research institute at heart. Their priority is technical exploration and innovation—not building consumer products, commercialization, or marketing. So please don’t morally pressure DeepSeek. The shortcomings of Chinese models can’t be blamed on them.

Many Chinese vendors only recently grasped how to properly develop large models. For example, Zhipu bet big on coding and delivered GLM-4.6, earning significant recognition. Kimi has spent this year refining R1 into something they call “Kimi-K2-thinking.” MiniMax is even later to the game—still figuring out its model architecture. Given that, it’s already impressive that these companies managed to absorb and implement pre-GPT-5/Claude-4 techniques and actively chase the frontier. We should grant them patience. They need time to catch up first; only through this process can they gradually form their own deep understanding. Whether they’ll eventually master LLM development—or remain stuck trailing closed-source models—remains to be seen.

Qwen’s situation is even more complicated. Its notoriously chaotic internal organizational structure is a hidden landmine. Additionally, Qwen’s operational strategy may be dragging down its competitiveness in today’s white-hot LLM arms race. As the “open-source king,” Qwen feels obligated to release models of various types and sizes—a massive drain on human and engineering resources. For instance, why release separate pure-text and VL (vision-language) models? Why not just launch a unified multimodal model from the start? Instead, they spend extra effort post-training the text-only version to make it publishable. While this benefits academic research, it slows Qwen down in the commercial race.

And that’s not all—Qwen also maintains product lines for Omni, Image, Video, TTS, and more. This poses serious challenges for organizational coordination. Furthermore, the failure of Qwen3-Max-Thinking is no less severe than GPT-5’s stumbles—in fact, it might be worse. At least OpenAI shipped a usable GPT-5 product. Qwen3-Max-Thinking still hasn’t had a proper public release. It’s hard not to suspect something is wrong internally or question whether Qwen’s team can truly achieve SOTA performance on large-scale models.

So my advice: wait. Let’s see what GLM-5, the next-gen Qwen3.5, and possibly DeepSeek-V4 (if it exists) deliver by year-end before drawing conclusions.

On the topic of Chinese models “gaming benchmarks” and being “high-score but low-performance”: I think DeepSeek is relatively honest. Qwen’s open-source models are largely fine—their purpose is scientific research, and benchmarking is part of that. As for other vendors’ benchmark claims? Take them with a grain of salt. Don’t believe the marketing hype—just try the models yourself. Claims like “surpasses Claude Sonnet 4 today” or “matches GPT-5 today” shouldn’t be trusted.

One more point: the perceived widening gap is partly due to the rise of agentic coding.

In the past, models were used mainly for chat. Users aren’t as sensitive to general dialogue quality as one might assume—judgments are highly subjective. On one hand, brand perception heavily biases users; on the other, most lack the expertise to objectively evaluate response quality.

But agentic coding changes that. Even top models like Claude rarely produce perfect code on the first try. Now, evaluation has become objective and fair: does the code compile? Does it produce the expected output? Under this lens, many Chinese models that obsessed over benchmark scores have been exposed. Ironically, GLM-4.6—a relatively compact model—has gained traction precisely because of its strong agentic coding performance.

千里牛 发表于昨天 06:36 | 显示全部楼层

China is the largest contributor to open-source large language models, accounting for approximately 17.x%.

First, there’s an element of necessity: as long as a gap exists, we can’t directly earn foreign currency at scale. You can’t rely forever on artificial barriers—like mutual “maritime embargoes”—to make money. Nor can you follow illogical thinking: “The U.S. built a supercomputer? Quick, seize it for the Planning Commission and use it to bury America!” “The U.S. developed AI? Grab it immediately and bury America!” “The U.S. took a hot dump? Eat it fast and bury America!” The superiority of socialism should manifest through comprehensive outperformance over the U.S.—in high technology and people’s living standards alike. Subduing the adversary without fighting is the highest form of strategy.

Second, it’s a deliberate strategy: building an ecosystem to cultivate user stickiness.

The U.S. already controls roughly three-quarters of the world’s total AI computing power.

Companies like Alibaba, ByteDance, Tencent, Baidu, Meituan, and Mianbi deserve national support—they’re on the front lines, reinvesting their earnings into this battle. When you compare prices left and right and ultimately let them earn some non-monopolistic profits, you’re making your own contribution.

Martha 发表于昨天 06:37 | 显示全部楼层

I believe the intrinsic capability gap between models themselves is narrowing. For example, under identical parameters, compute resources, and runtime environments—with standardized evaluation—the performance gap between open-source and closed-source models, if any, wouldn’t be particularly large.

However, the gap in components outside the model itself may be widening.

My hypotheses are as follows:

Underlying runtime architecture may be diverging. Closed-source models don’t need to worry about compatibility—they can optimize their backend platforms freely. Moreover, they can readily absorb innovations from open-source projects. For instance, much of what DeepSeek releases weekly consists of low-level improvements. Integrating these into open-source frameworks often takes time due to porting and adaptation overhead, whereas closed-source teams can rapidly test and incorporate them directly into their own systems—effectively letting open-source contributors do R&D work for them.

Data volume and world knowledge. Closed-source players face fewer constraints regarding copyright or data sourcing—they can ingest vast amounts of data aggressively. Additionally, these companies already possess deep historical data reserves and institutional expertise.

Compute power. The disparity here is stark. Closed-source models enjoy overwhelmingly superior infrastructure—apart perhaps from electricity supply constraints.

Model-auxiliary components. Closed-source providers can build far more sophisticated supporting systems around their models.

For example, query caching: identical or similar queries can be stored in a cache layer, managed by a lightweight frontend model, and only escalated to a larger backend model when necessary. While some open-source vendors implement this in their own products (e.g., DeepSeek offers lower-cost caching), the open-sourced model weights themselves don’t include this capability—users must deploy and manage it themselves.

Similarly, on-the-fly self-training: closed-source systems can continuously refine themselves using user feedback. They can orchestrate multiple models with varying capabilities—routing routine queries to smaller or older models, while reserving newer, more capable models for novel tasks—and iteratively improve the whole system. Open-source deployments struggle here: when Baidu or Yuanbao use DeepSeek models, they certainly won’t share their query logs or user data back with Hyperplane (DeepSeek’s developer).

Finally, auxiliary tooling is naturally more mature in closed-source ecosystems. Take context handling: even if a closed-source model natively supports only 200K tokens, it can employ context compression techniques to effectively claim support for 830K tokens—and actually deliver usable performance. This reduces both computational load and latency. Open-source solutions generally lag here, except when used through the official products built by the original developers—which explains why some users feel that “the official version of an open-source model works better.”

In summary, beyond raw model architecture, the real gap lies in compute resources and auxiliary strategies. If closed-source providers focus solely on delivering the best possible service—without worrying about open-sourcing or cross-platform compatibility—their end-to-end offerings will inherently outperform what open-source models can achieve on their own.

Jordan 发表于昨天 06:38 | 显示全部楼层

open AI一定要open

从性能上来说，开源追不上最强闭源才是正常的，开源追上了闭源的才是令人意外的。

不过，也不能因此否认开源的存在。像dp，无疑是打破ai垄断的第一枪，我觉得dp就像星星之火吸引很多ai人员打破openai的闭关大门。

而且开源的底气，其实在工程师上面，简单来说，就是等你花大量时间把现阶段的东西研究透了，而工程师又进行升级了，所以他们不惧开源。毕竟科技企业非常需要打出名声，然后能获得大量投资，只要公司是健康积极的，大概率能在大量资金加持下实现技术快速迭代，反而是外强中干的企业喜欢关起门来数钱。

而且若从模型维护等角度的话，理论上开源模型比闭源模型强大，因为开源模型不需要做生产力部署和维护，理论上可以集中算力训练，而运营有第三方公司，硬件需求多了，采购的硬件自然便宜。

反之闭源模型一定要生产力部署，必然消耗一波算力，因为没有惠及第三方，无法降低硬件采购成本。因为闭源所以所有的开发测试都要自己做基本上处处都是钱。

以当时的llama3.1为例。llama3中文微调花了72小时，llama3.1花了8小时不到，对应的ollama的适配几乎都没有超过24小时。

但这个无法一概而论，因为两者看待的角度不一样。

使用闭源模型说明使用者认为模型业务本身是它的核心竞争力。

使用开源模型说明他认为自己的核心竞争力不在于模型本身，而在于基于模型的其他业务，同时还能避免某个厂商把这一部分垄断，从而影响到自身的其他业务。

两者的区别只在于商业上的。

比如Google的核心竞争力在于广告，那么其他很多基础技术对他来讲都不是核心业务，他把它们开源之后反而可以吸引别的资源。操作系统是微软的核心业务之一。那么他就永远不可能把windows开源。

其实就是商业角逐罢了，根据需求和成本综合考虑该用哪个就是了。

既然说到根据需求选模型，那作为普通用户，其实完全没必要陷入“开源好还是闭源好”的理论争执，怎么顺手怎么来才是硬道理。

哪怕是闭源领域的顶流GPT，在处理极度复杂的逻辑推理、代码任务或者需要极高准确性的场景时，它依然是目前的标杆，该付费时就付费，把它当成最强外脑来用绝对不亏。但在很多具体的细分场景下，通用大模型未必就是最高效的，这时候就得用“术业有专攻”的工具。

比如做汇报方案，与其费劲给大模型写Prompt调格式，不如直接用迅捷AiPPT，给它个主题，不到10分钟就能生成从大纲到排版都挺像样的PPT，还不显粗糙，这才是打工人的省时利器。

再比如日常娱乐或者搞点视觉创作，豆包的表现就很亮眼，生图能力没话说，只要给个方向和要求，几秒钟就能出图，而且在中文语境下甚至比国外模型更懂你。

所以说，未来大概率不会是某一个模型统治世界，而是开源与闭源并存、通用与垂类互补。无论是追求极致算力的巨头，还是深耕应用的小而美，只要能帮咱们解决实际问题，就是好AI。

煮鱼 发表于昨天 06:38 | 显示全部楼层

I believe the gap between open-source and closed-source large language models is gradually narrowing...

But closed-source models still hold significant advantages in core hard capabilities—the key gaps hinge on practical factors like money, data, and freedom of use.

The fundamental difference between the two is this:

Open-source models are like a solid foundation for a building—you can construct anything on top however you like, with no restrictions.

Individuals and small companies can pick them up and use them immediately without having to start from scratch.

For example, Alibaba’s Qwen and Meta’s LLaMA 3 work exactly this way. If a small business wants to build a custom AI customer service agent, it can just fine-tune an open-source model.

Closed-source models, on the other hand, are like fully built skyscrapers—you can move in and use them, but you can’t modify the structure.

Models like GPT-5 and Gemini can only be altered by the big companies that developed them. Everyone else can only call their APIs; changing the underlying logic is impossible.

Yet the gap between them is indeed shrinking...

In the past, open-source models often stumbled even on simple reasoning tasks—but things have changed.

Now, open-source models like DeepSeek can solve math problems and write code nearly as well as early versions of GPT.

This improvement comes from two main drivers:

First, global developers continuously optimize these models—bugs get patched quickly by the community.

Second, the open-source ecosystem has developed many cost-saving techniques, achieving strong performance without needing the massive compute resources that closed-source models rely on.

For instance, LLaMA 3 can even run on smartphones, delivering user experiences in everyday scenarios that are hardly distinguishable from closed-source alternatives.

However, closed-source models still maintain advantages that are hard to overcome quickly—because narrowing the gap doesn’t mean catching up entirely.

Closed-source models benefit from several hard-to-replicate barriers:

For one, they have abundant funding and compute power. Big tech companies can invest hundreds of millions—or even billions—of dollars into training and keep pouring money into continuous optimization.

Open-source efforts, mostly driven by communities or small teams, simply can’t match that scale of financial and hardware support.

As a result, when handling complex tasks, closed-source models respond faster and make far fewer errors.

Moreover, closed-source models have access to exclusive high-quality datasets—such as professionally annotated medical or financial data—which produce more reliable models.

Open-source models must rely on publicly available data, which is often inconsistent in quality, poorly standardized, or full of duplicates, inevitably compromising performance.

Take GPT-4o, for example: it rarely behaves erratically and can precisely interpret complex instructions—a result of enormous engineering effort by its creators.

Open-source models, by contrast, occasionally make basic mistakes—like answering off-topic questions—and are generally weaker at filtering harmful content.

To sum up, the decisive factors behind the gap come down to just a few points:

Data and money are core

Closed-source wins with exclusive high-quality data and seemingly unlimited compute budgets.

Open-source compensates by crowdsourcing good public data and using clever, efficient methods to offset limited resources.

Customization vs. convenience

Open-source excels in flexibility—it’s ideal for specialized needs and custom development.

For example, a factory could fine-tune an open-source model to create an AI quality-inspection system tailored to its own equipment.

Closed-source wins on stability and ease of use—perfect for businesses that need customer service or content generation without wanting to manage infrastructure.

Community vs. dedicated teams

Open-source thrives on rapid, decentralized innovation from a global developer base—but can feel chaotic.

Closed-source relies on professional, in-house teams that ensure consistent quality—but innovation depends entirely on internal processes and can sometimes lag behind.

Overall, for everyday, straightforward tasks with modest requirements, open-source models are already more than sufficient.

But for critical applications like medical diagnosis support or high-stakes financial decisions, closed-source models remain the safer choice—for now.

Still, the gap between them is undeniably shrinking. In many ways, they’re each dominating their own domain.

Zella 发表于昨天 06:39 | 显示全部楼层

Open-source large models have nearly closed the gap with closed-source ones in various benchmarks, and thanks to the overall improvement in model capabilities, the user experience in everyday simple conversational tasks now feels quite similar.

However, the vast majority of users don’t interact directly with raw APIs—they use applications powered by these models. To deliver the best possible experience, these applications still rely on closed-source-style product design and engineering. Therefore, the pursuit of open-source models catching up to closed-source ones will be a long-term process, and the two aren’t truly adversarial.

It would be a great step forward when, one day, DeepSeek builds its own app as the definitive best-practice showcase for using its open-source DeepSeek models—rather than leaving that role to third parties like Tencent’s Yuanbao.

12 / 2 页下一页

Khloe

+关注

4

主题
0

粉丝