In the last two years, SWE-bench verified performance jumped from 4.4% to over 70%; frontier models routinely solve repository-scale tasks that were unthinkable in 2023. Dario Amodei in Sept, 2025 stated “70, 80, 90% of the code written in Anthropic is written by Claude”.
Industry surveys 2025 DORA Report that ~90% of software professionals now integrate AI into their workflows, with a majority describing their reliance as “heavy.” At the same time, marketing copy has jumped straight to “weekend apps” and “coding is over.” But when you look closely at production teams, the story is messier.
Rigorous studies find that many teams experience slowdowns, not speedups, once they try to push LLMs into big codebases: context loss, tooling friction, evaluation overhead, and review bottlenecks erode the promised gains and increase task completion time by 19%. Repository-scale benchmarks show that prompt techniques boost performance by ~19% only on simple tasks, but that translates into ~1–2% improvements at real-world repo scale.
In parallel, research on long-context is very clear on the “context window paradox”: as you increase context length, performance often drops, sometimes by 13–85%, even when the model has perfect access to all relevant information. In another large study, models fell from ~95% to ~60–70% accuracy when exposed to long inputs mixing relevant tokens with realistic distractors. Larger apertures without structure amplify noise, not signal.
So:
That raises a few concrete questions:
After spending weeks researching what's true and what's false, the conclusion was that it is not more model capability (at least it's not for now) or prompting tricks that make the big difference, but it is systematic context engineering and system design.
\n
At this point, we had an intuition of what to do, but the question was how. We had to synthesize all these long nights of readings, so we came up with a four-optimization areas to build a production platform from the ground up:
\
Rather than arguing from theory alone, over the span of 15 weeks on a part-time basis (nights and weekends), we researched and developed this architecture in an internal production-ready artifact, ~220k of clean code, 78 features. That became our lab and playground. \n \n Below is everything that we learned, organized into 4 areas.
The first step is deliberately unglamorous: stop burning cycles on infrastructure no model can help you with. The logic is straightforward:
So the stack was curated with three criteria:
The tech stack, of course, shifts based on what you are building but concretely, for this experiment, it meant:
Many modern AI-native platforms and vibe coding platforms converge on similar stacks (Expo is used by Bolt.new and Replit, Tailwind CSS and shadcn/ui is used by Lovable and v0, Supabase is used by Lovable). The point is not the specific logos; it’s the principle: pick tools that shift labor up the stack, where models can actually help, and remove as much bespoke infra as possible.
Step 2 translates prompt engineering research into repeatable operational practices in our AI-native IDE (Cursor).
The key move is to treat prompts as structured artifacts, not ad-hoc chat messages. In practice, that translated into:
Cursor’s rules, agent configuration, and MCP integration serve as the execution surface for these workflows.
We came up with a static layer and a dynamic one.
The first layer is a static rulebook that encodes non-negotiable project invariants in a machine-readable way. These rules cover:
In Cursor, rules can be:
Conceptually, this is the “constitution layer”: what must always hold, regardless of the local task. Instead of hoping the model “remembers” the conventions, you formalize them only once as a shared constraint surface.
The second layer is a Native Repo MCP server that exposes live project structure as tools, resources, and prompts (so you don’t write the same thing over and over).
Of course, you can personalize this based on your repo/project. For our experiment, the MCP modules included:
Instead of front-loading giant blobs of text into every prompt, the IDE + MCP gives the agent an API over the codebase and metadata. The model requests intelligently on its own what it needs via tools and resources.
This addresses several failure modes identified in long-context research:
Overall, it gives the IDE more space for autonomy within the sessions and repo boundaries.
A static/dynamic context substrate is only useful if you activate it consistently. The final part of Step 3 is a set of specification templates within the MCP repo that:
MCP’s three primitives (tools, resources, prompts) are used to standardize this activation: the agent can list capabilities, fetch structured metadata, and then apply prompt templates.
(Note that frameworks like spec-kit (with its constitution.md and /speckit.specify flow) and Archon (as a vector-backed context hub) offer alternatives to this static:dynamic pattern.)
The fourth step moves from “single AI assistant” to orchestrated agents & resources, and it reuses the same context architecture built in Area 3 and dynamics built in Area 1-2. The agents use the static–dynamic context as shared infrastructure at the system level.
Hierarchical, Role-Based Agents with Validation Gates
Agent orchestration follows role specialization and hierarchical delegation:
This is paired with a strict two-phase flow:
Phase 1 – Agentic planning & specification. \ Planning agents, guided by AGENTS.md (“README for AI agents”), produce a central, machine-readable spec. Similar to Product Requirement Prompts (PRPs), this artifact bundles product requirements with curated contextual intelligence. It encodes not just what to build, but how it will be validated. Cursor’s Planning mode turned out to be a strong baseline assistant, and the following tools proved consistently useful alongside it for inspiration and execution:
Phase 2 – Context-engineered implementation. Implementation agents operate under that spec, with the rulebook and MCP providing guardrails and live project knowledge. Execution is governed by a task manager that applies weighted autonomy: high-risk actions (e.g., migrations, auth changes) require human checkpoints; low-risk refactors can run with more freedom.
Crucially, validation is embedded:
Specifications include explicit validation loops: which linters, tests, and commands the agent must run before declaring a task complete.
Risk analysis upfront (drawing on tools like TaskMaster’s complexity assessment) informs task splitting and prioritization.
Consistency checks detect mismatches between specs, plans, and code before they harden into defects.
\
That split is not a universal constant, but to us it seems a concrete pattern: agents absorb routine implementation (coding); senior software engineers concentrate on design and risk.
\
This is a single experiment with limitations:
The broader best practices (Area 2) and the following design principles, however, are portable:
Separate stable constraints from the dynamic state. Encode invariants in a rulebook; expose current reality via a programmatic context layer.
Expose context via queries, not monolithic dumps. Give agents MCP-style APIs into schemas, components, and scripts rather than feeding them full repositories.
Standardize context at the system level. Treat static + dynamic context as shared infra for all agents, not as per-prompt basis. \n Front-load context engineering where the payoff amortizes.For multi-sprint or multi-month initiatives, the fixed cost of building rulebooks and MCP servers pays back over thousands of interactions. For a weekend project, it likely does not. \n
In this experiment, the main constraint on AI-assisted development was not model capability but how context was structured and exposed. The biggest gains came from stacking four choices: an AI-friendly, low-friction stack; prompts treated as operational artifacts; a two-layer context substrate (rules + repo MCP); and agent orchestration with explicit validation and governance.
This is not a universal recipe. But the pattern that emerged is that: if you treat context as a first-class architectural concern rather than a “side effect” of a single chat/session, agents can credibly handle most routine implementations while humans focus on system design and risk.
Disclosure: this space moves fast enough that some specifics may age poorly, but the underlying principles should remain valid.
\n \n
\


