AI Coding at a Crossroads: Spec Is Eating Human Coding, Agent Reinvention Is Dragging Down Efficiency, and After Token Costs Spiral Out of Control, Context Engineering Becomes the Decisive Battleground
The AI coding ecosystem in 2025 is already defining a new role for programmers in 2026. The answer may be hidden in a pile of smoking Markdown files.
Over the past six months, Spec-driven development has exploded in popularity. Repositories are rapidly filling with layer upon layer of “Markdown scaffolding” designed specifically for Agents. It’s hailed as the cutting-edge solution to AI coding: using a contract to force Agents to actually deliver.
But here’s the question: can this “contract” truly handle the decades of accumulated complexity in software engineering? Or will the ultimate value of programmers shift from “writing code” to “defining rules”—using natural language that AI can understand to tame this technological revolution?
The Ceiling of Code Completion and the Inevitable Rise of Agents
The evolution of AI coding has clearly split into two eras.
The first wave was pioneered by Copilot and Cursor: a human-led programming paradigm where AI’s role was to predict the “next token” or “next edit location,” boosting speed and fluency within a local scope.
The boundaries of this paradigm are actually quite clear. For completion to feel seamless and not disrupt flow, end-to-end latency must be tightly constrained to a few hundred milliseconds. This inherently limits model size and context length—models can’t be too large, and full-project context is simply impractical.
Meanwhile, the scope of completion keeps expanding—from single-line predictions to cross-function, cross-file generation, rewriting, and even local refactoring. While the user experience still has room for improvement, expecting an AI to grasp global intent, project constraints, and dependencies within such tight timeframes is already pushing engineering limits. It places extreme demands on post-training, context selection, inference strategies, and the entire engineering pipeline.
The second wave—especially over the past 6–12 months—has brought a true paradigm shift: the rise of Agents. No longer limited to predicting the “next token,” Agents now take full ownership of tasks—from requirement analysis and code generation to tool invocation and result validation.
Compared to Agents, the completion paradigm suffers from narrow modification scope and high cognitive load on developers. As models and toolchains mature, Agents will increasingly cover more stages from requirement to delivery, becoming the dominant workflow. In Agent-centric scenarios, completion may recede into the background, serving as a low-level capability that supports fine-grained Agent execution.
TRAE core developer Tianzhu (a.k.a. “Tian Pig”) points out that this doesn’t mean the completion paradigm has hit a technical ceiling. On one hand, many developers still enjoy the act of “writing code themselves,” and in those contexts, the completion experience still has significant room for refinement. More importantly, from a capability perspective, completion always solves the same problem: given a context, predict the most reasonable next editing action. Historically, this assisted human coding; in an Agent system, it can equally assist AI’s own execution. For example, generating parameters for tool calls or populating Agent chat panels are essentially different forms of “completion scenarios.”
Another interesting trend this year: nearly all leading coding tools are evolving toward a tripartite capability model—IDE, CLI, and Cloud. Many products start with one form but quickly expand into the other two, because users don’t need a specific interface—they need a complete workflow that delivers results across contexts. This helps explain the “origin” and character of representative tools: Claude Code began as CLI-native and thus excels there; OpenAI Codex originated in the cloud; Cursor started in the IDE and remains one of the biggest players in that space.
Notably, CLI and Cloud Agents have been Agent-first from day one. They demand less UI sophistication—operating either in terminals or simplified web interfaces—and rely on GitHub PRs for collaboration and delivery.
Yet Tianzhu believes the IDE will remain the most widely used entry point, simply because it aligns best with developers’ long-established workflows. Early in his team’s practice, he realized that disruptive innovation in professional productivity tools often requires a full reshaping of developer cognition and work patterns. In his view, the IDE itself may undergo fundamental transformation within three years—no longer centered around the editor. TRAE’s SOLO mode and Cursor’s Agent mode are early industry experiments along this path.
In plain terms: the IDE is shifting from a “toolbox for humans” to a “shared workspace for humans and AI.” Many human-centric features in traditional IDEs are now being decomposed into smaller, clearer, AI-friendly tools that Agents can invoke on demand. Thus, the IDE is evolving into a capability container and execution environment for human-Agent collaboration.
All three paths—IDE, CLI, and Cloud—are converging toward agentic behavior.
The IDE evolves the collaborative experience between human and Agent; CLI strengthens Agent capabilities in engineering automation and pipelines; Cloud Agents extend the temporal and spatial boundaries of development collaboration.
Different forms, same goal: an Agent-dominant paradigm. Under this model, everyone’s core requirements converge: correct tool usage, long-horizon task stability, and continuous correction based on feedback. Therefore, Coding Agent capability is fundamentally about long-horizon stability and tool orchestration.
As execution shifts from human to Agent, the decades of implicit complexity in software engineering—previously held together by experience and tacit understanding—must now be made explicit upfront. And it is at this moment that Spec is urgently recalled.
Can Spec Really Solve AI Coding’s Problems?
From the moment the term “Spec” went viral to today, only a few months have passed—but an awkward yet unavoidable reality has surfaced: when people say “Spec,” they’re no longer talking about the same thing.
Some equate Spec with better prompts; others see it as more detailed product requirement documents or architecture designs. But for many engineering teams, Spec simply means “adding a few more Markdown files while coding.”
Thus, repositories quickly fill with gemini.md, claude.md, agent.md, cursor-rules, various Skills, plus GitHub config files. Over recent months, major platforms have rolled out “reusable context frameworks”: Claude Skills, Cursor Team Rules, GitHub Copilot Spaces, alongside third-party systems like Tessl and BMad Method (BMM). The toolchain has exploded in just one year, spawning a new class of infrastructure primitives.
Many teams report an intuitive insight: the issue isn’t a lack of Spec—it’s a lack of Context. Some have even conflated the two, declaring “Spec is Context Engineering” or “Spec-Driven Development equals Context Engineering.”
But leading Chinese tooling teams tend to disagree. In their view, Spec is a critical and stable subset of context—a “directive context” that clearly articulates goals, constraints, and acceptance criteria, effectively giving the Agent an executable contract: what to build and what “done” looks like.
Under this division, Spec answers “what are we building?” while Context Engineering ensures “does the model have enough information right now?” Spec doesn’t automatically become effective context, but it often serves as the long-term source of high-quality context. The two are deeply coupled but not interchangeable.
Therefore, Spec shouldn’t be confined to fixed document formats. More accurately, Spec is the sum of all contracts guiding code generation: product docs, design mockups, API definitions, boundary conditions, acceptance criteria, and execution plans—all can be part of a Spec system, just at different stages and granularities.
Yet precisely because Spec is “broad in scope, diverse in form, and long in lifecycle,” it resists standardization.
In this wave of Spec-driven development, Kiro is often cited as a key driver. Al Harris, Kiro’s tech lead, shared publicly that his team experimented with roughly seven different Spec implementations—from ephemeral spec and layered spec to TDD-based spec—essentially “adding ‘spec’ as a suffix to everything.” Ultimately, they were answering three questions: when should a Spec be finalized, how detailed should it be, and how can it stay aligned through iterations?
He emphasized that their Spec-driven approach is still evolving, with the ultimate goal of covering the entire SDLC—integrating requirements, design, task breakdown, and verification mechanisms to restore the rigor of traditional software engineering to AI development.
This brings us to a core question: can Spec absorb the complexity accumulated over decades of software engineering?
According to Huang Guangmin, product lead at CodeBuddy, Spec standards are essentially the materialization of software engineering theory within AI programming tools.
But the problem is this: software engineering theory, despite decades of development, has never converged on a universal standard in practice. Thus, different Spec variants represent different trade-offs (e.g., flexibility vs. rigor), and their optimal granularity varies by task.
He argues that Spec effectiveness is inherently scenario-dependent because Spec exchanges three things via documentation/structure: correctness, efficiency, and maintenance cost. Since different scenarios weight these differently, no single standard will emerge—instead, multiple widely accepted forms will coexist.
Is Spec a Return to the “Deprecated” Waterfall Model?
Software engineering is inherently a complex, uncertain system. In long-horizon tasks, large models hallucinate, forget, and drift from goals. Without corrective mechanisms, Agents easily veer off course, causing rework costs to balloon. This is why Spec feels appealing again—it attempts to fix key goals, constraints, and acceptance criteria upfront.
But controversy follows. One agile practitioner bluntly stated that Spec-Driven Development (SDD) is heading in the wrong direction. To him, it tries to solve a problem already proven unsolvable: removing developers from the software process. In this vision, programming Agents replace developers, guided by meticulous planning—mirroring the waterfall model’s demand for exhaustive documentation before coding begins.
Yet as the classic paper “No Silver Bullet” argued, planning cannot eliminate uncertainty in software development. Agile methods long ago abandoned heavy upfront documentation. So, is AI coding ushering in a “waterfall regression”?
From an engineering perspective, the real question isn’t “should we write everything down?” but “which parts deserve structuring?” Huang Guangmin clarifies that Spec Coding doesn’t aim to structure a developer’s entire thought process, but rather the parts most prone to error in long-horizon tasks—and most worth verifying and preserving.
The industry is still exploring Spec. A more reasonable form is a “living contract”—not a static, one-time document, but a key intermediate state in a Plan-Execute loop. Good Spec-driven practice isn’t “write a perfect Spec first, then code,” but rather using Spec to clarify correctness criteria, then continuously aligning Spec and code through inference-execution-feedback cycles. “This actually reflects engineering reality better than traditional development: requirements change, constraints evolve, implementations shift—the key is making these changes traceable, verifiable, and reversible.”
Zooming out further, this discussion naturally leads to a longstanding theme in software engineering. Tianzhu noted that some AI coding explorations reminded him of an early, unfulfilled ambition: is there a sufficiently complete, verifiable description that can clearly define how a system operates and be reliably reproduced?
In this vision, Spec transcends being mere pre-code documentation or process logs—it’s elevated to a higher plane, potentially evolving into an engineering artifact that’s more abstract and stable than code itself.
Viewed historically, software engineering has progressed from binary to assembly, high-level languages, DSLs, and declarative configurations—each step raising the abstraction level for expressing human intent. Along this trajectory, Spec represents the next attempted leap: abstraction at the natural language layer.
Of course, this path is arduous. The ambiguity of natural language makes direct engineering difficult, ensuring this remains a challenging, uncharted exploration—not a proven or disproven conclusion, but a direction the industry is actively probing.
Software Abstraction: Why Agents Keep Reinventing the Wheel
A long-standing, widely criticized issue in AI coding is this: Coding Agents strongly prefer “implementing features from scratch” rather than reusing mature libraries.
On one hand, the field explores higher-level abstractions; on the other, it bypasses decades of software engineering’s reuse ecosystem—especially battle-tested, optimized libraries and interfaces. Real-world development rarely starts greenfield; it usually involves modifying or extending existing applications using open-source libraries.
But for models, “writing a working version from scratch” is often the lowest-risk path. When uncertain about a library’s version, usage, or boundaries, falling back to self-implementation is almost inevitable. First, pretraining corpora unevenly cover libraries; second, alignment-stage rewards favor “runs correctly” over “prioritizes reuse of existing abstractions.”
Add to this the timeliness problem: libraries update rapidly, APIs change frequently, and documentation is often missing or contradictory. Even if users specify a library, unless critical details are accurately included in context, models may still misuse it.
But this isn’t unsolvable. The key isn’t constant manual correction or micromanagement of Agents, but enriching their reliable information sources. As Huang Guangmin emphasizes, instead of “reminding” at the output stage, prepare information at execution time: use MCP tools like Context7 to supply versions, usage examples, and best practices, then inject correct usage via “progressive disclosure.” For internal component libraries, systematically document examples, boundaries, and patterns so Agents can stably reuse them—instead of constantly reinventing the wheel.
When new abstractions aren’t mature, old ones are bypassed, and context keeps growing, all these issues converge at one point: runtime cost.
The Rise of Token Engineering: Why Costs Suddenly Spiraled Out of Control
When did Tokens start hurting? Not when you exhausted your free tier—but when you realized they’d ceased to be “conversation consumption” and had become a core variable dictating pricing, product strategy, and even forcing platforms to change rules.
Two events this year brought this to the forefront almost simultaneously.
First, Cursor: users on relatively cheap plans found ways to generate far more usage than expected, breaching cost boundaries. Within a year, Cursor raised prices five times and cut features repeatedly to stem losses.
Second, Claude Code’s Token leaderboard: the global top user burned 7.7 billion tokens in 30 days on a 200plan—equivalenttoa50,000 bill. This “usage ranking” forced Anthropic to impose rate limits on paying users.
Both incidents reveal the same truth: Tokens have shifted from a “billing unit” to a lifeline.
Why did Tokens suddenly become an order-of-magnitude more complex in 2025?
The root cause isn’t “models got more expensive,” but paradigm shift: LLM applications are moving from “Q&A” to “Agents getting things done.” When models are tasked with “completing a job,” Token cost is no longer per input-output—it’s the full lifecycle cost across reasoning-execution-feedback loops.
The biggest change: hidden costs from tool calls now dominate.
Completing a task often requires multi-turn dialogues, each involving dozens to hundreds of tool invocations. This means context contains大量 reusable stable information—creating opportunities for caching and reuse. But pricing rules vary across platforms: different models charge differently for input/output, cache hits, and hidden tool-call overhead, amplifying Token economics’ complexity.
Early on, costs came mainly from dialogue itself. Now, retrieval results, file diffs, terminal logs—much of this tool output is meant for the next model turn, repeatedly fed back into context, becoming a major new cost source.
That’s why engineering teams now intensely focus on log filtering, diff slicing, and output summarization—all aimed at controlling “Tokens shown to the model,” stripping irrelevant data from context.
Spec Coding and multi-Agent collaboration further inflate cost structures: with Spec Coding, Agent outputs include not just code, but Specs, Plans, ToDos, changelogs, and checklists—intermediate artifacts repeatedly generated, referenced, and iterated, becoming persistent context. Multi-Agent setups turn Tokens into a communication efficiency problem—what to transmit, at what fidelity, how to compress, and how to avoid distortion—all trade-offs between cost and information fidelity.
The Real Battlefield of Token Engineering: Context Management
Developers still care about three things: effectiveness, efficiency, and cost. But what determines these is no longer how well a single prompt is written—it’s how Agents organize context, invoke tools, and retry/correct behind the scenes.
This means Token complexity now falls squarely on AI coding tools themselves.
On most model platforms, context execution relies on KV cache. In theory, unchanged context hits cache, avoiding recomputation. But in practice, cache hits are far from ideal. On a miss, you effectively pay again for the same context: API users re-incur prefill costs; subscription users hit rate limits earlier and more frequently—and rate limits are highly correlated with cache usage.
That’s why context platform engineering sounds so “utilitarian”: maximize KV cache hit rate. Not to save pennies, but to prevent long-horizon Agent tasks from being dragged down by redundant, meaningless context refreshes that cripple throughput and stability.
What information should persist long-term versus what’s temporary? What should be cleaned after phase completion? These choices directly determine cost structure, rate ceilings, and whether Agents can sustain execution.
The Technical Evolution of Context Engineering
Looking back, Tianzhu believes there’s no silver bullet in context engineering. Instead, the field has evolved through trial and error—from early Prompt Engineering to today’s more systematic Context Engineering.
Initially, Fine-Tuning seemed a direct solution: injecting domain knowledge into the base model to fill world-model gaps. But practice showed it’s costly, inflexible, and ill-suited for frequent model switching in AI coding. In contrast, RAG-style “plug-in knowledge” proved more cost-effective and aligned with real-world usage.
In early AI coding, interactions were chat-based Q&A, making “first-turn context” critical—driving the adoption of RAG recall mechanisms.
With Coding Agents, collaboration shifted to multi-turn, long-horizon Agent Loops. Now, context doesn’t need to be fully provided upfront; Agents retrieve on-demand during execution. This spurred embedding search and grep capabilities.
This year, Cline and Claude Code moved from traditional RAG to grep, with Cline bluntly stating, “We can’t solve today’s problems with 2023 methods.”
Importantly, embedding search isn’t obsolete—it’s like a database index, boosting recall efficiency under certain conditions. Grep excels in determinism and exact matching. They often serve different retrieval stages and needs.
As task complexity grows, Agentic Search emerges, often paired with Sub-Agent mechanisms. In complex scenarios, much tool-call history lacks long-term value, leading to dedicated Search Agents: they run independent loops for multi-round retrieval, filtering, and validation, treating embedding search, grep, and LSP as composable tools.
At this stage, the industry realizes: the real scarcity isn’t context length—it’s the ability to organize effective context.
Split context into stable and dynamic parts; keep dynamic segments precise, length-controlled, and verifiable; then use caching, trimming, summarization, and retrieval to keep Token marginal costs within engineering tolerance—while avoiding rework from missing critical info. This is where Token engineering ultimately lands.
Final Thoughts
If we view AI programming as a systems engineering challenge, it comprises at least four layers: models for “thinking,” tools for “acting,” IDEs for human-AI interaction, and context for “memory and continuity.”
Models set the ceiling; tools determine if it can actually do things; IDEs enable humans to express intent efficiently and correct errors in time; context binds it all together—carrying historical decisions, engineering constraints, and continuity, forming the foundation of long-term reliability.
Thus, the true watershed in future AI programming may not just be “whose model is stronger,” but who can consistently and accurately transform the implicit constraints, memories, and consensus of the engineering world into context structures that models can understand, execute, and repeatedly verify.
As Tianzhu sees it, AI programming has never been just a contest of model capabilities—it’s the synergy between engineering systems and model power.
This is the “engineering gap” AI coding is now closing.
For deeper conversations and expanded details:
“Talking with CodeBuddy’s Huang Guangmin: A Pile of Smoking Context Is Deciding the Success or Failure of AI Programming” (https://www.infoq.cn/article/j8jMvIOfmZS8M33t72uP)
“Talking with TRAE’s Tianzhu: AI-Era Challenges and the Cognitive Evolution of Programmers” (https://www.infoq.cn/article/EiO80Aacx4sMZHjtXqNR)
|