Who coined the term agentic engineering?

Andrej Karpathy, the former head of AI at Tesla and co-founder of OpenAI, coined agentic engineering in February 2026. He had previously coined vibe coding in February 2025. Exactly one year later, he declared vibe coding passe and proposed agentic engineering as the more accurate term for professional AI-assisted software development.

What are the key principles of agentic engineering?

Google's Addy Osmani identified five principles: 1) Plan before prompting with design docs or specs, 2) Direct with precision giving agents well-scoped tasks, 3) Review rigorously evaluating output like a human PR, 4) Test relentlessly as the biggest differentiator from vibe coding, 5) Own the system maintaining documentation, version control, CI, and production monitoring.

What is the history of agentic AI?

Agentic AI evolved through distinct eras: academic foundations (1950s-2017) from Turing to deep learning, the Transformer revolution (2017-2022), the autonomous agent explosion with AutoGPT and BabyAGI (2023), infrastructure buildout with LangChain and MCP (2024), the vibe coding phenomenon and $4.7B market (2025), and the agentic engineering discipline coined by Karpathy (2026). Each era built on the previous to enable increasingly autonomous AI systems.

What was AutoGPT and why did it matter?

AutoGPT was an autonomous AI agent created by Toran Bruce Richards and released in March 2023. It demonstrated that LLMs could run in a loop, planning and executing multi-step tasks with minimal human intervention. The repository amassed over 100,000 GitHub stars within months, making it one of the fastest-growing open-source projects in history and proving the concept of autonomous AI agents.

What is Karpathy's Software 3.0 framework?

Karpathy frames software evolution through three paradigms: Software 1.0 (traditional code with explicit instructions), Software 2.0 (neural networks optimized from data), and Software 3.0 (AI-native applications where natural language is the programming interface). LLMs function as a new kind of programmable entity, and the programming language is English itself.

What is the Agentic AI Foundation?

The Agentic AI Foundation (AAIF) was formed under the Linux Foundation in December 2025 with founding contributions from Anthropic (MCP), Block (goose), and OpenAI (AGENTS.md). Platinum members include AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI. It provides a neutral foundation to ensure agentic AI evolves with open standards.

Will agentic engineering replace software engineers?

No. Agentic engineering shifts what engineers do, not whether they are needed. Engineers move from writing code to architecting systems, directing agents, reviewing output, and maintaining quality. Senior engineers with strong fundamentals become more valuable as force multipliers. Google's Addy Osmani found that agentic engineering disproportionately benefits senior engineers who can leverage AI effectively.

What are the best agentic engineering tools in 2026?

The best tools by category: No-code platforms like Taskade Genesis (free, $6/mo) for non-developers. Developer frameworks like CrewAI, LangGraph, and AutoGen for custom agent systems. Autonomous coding agents like Cursor ($20/mo), Devin 2.0 ($20/mo), and Claude Code. Enterprise orchestration through LangGraph and the OpenAI Agents SDK. The right tool depends on skill level and use case.

What is intent engineering and how does it differ from prompt engineering?

Intent engineering is the discipline of encoding organizational purpose (goals, values, trade-offs, decision boundaries) into machine-readable, machine-actionable parameters that shape how AI agents make autonomous decisions. Prompt engineering governs individual instructions. Context engineering governs what AI needs to know (RAG, MCP, information state). Intent engineering governs what AI needs to want — ensuring agents optimize for what the organization actually needs, not just what is easiest to measure. The Klarna case study demonstrated the cost of deploying agents without intent alignment: the AI resolved 2.3 million conversations efficiently but destroyed customer relationships because speed was the wrong optimization target.

Why do 74% of companies report no tangible value from AI deployments?

According to industry research, 74% of companies have yet to see tangible value from AI because they solved the wrong problem. They proved AI can perform individual tasks but failed to connect AI capabilities to organizational goals at scale. Deloitte found 84% have not redesigned jobs around AI and only 21% have mature agent governance. The gap is not model quality or context engineering — it is intent alignment. Organizations that give agents clear metrics reflecting actual business purpose (not just the easiest-to-measure metrics) see dramatically better results.

What is cognitive debt in the context of agentic engineering?

Cognitive debt is the gap between system complexity and human understanding. When AI agents generate code faster than humans can comprehend it, systems work but no one fully understands why. It is the agentic engineering equivalent of technical debt. Agentic engineering addresses it through specifications as documentation, test suites as verification, and architectural ownership.

What did Gartner predict about agentic AI adoption?

Gartner predicts 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025. By 2028, 33% of enterprise software will include agentic AI. However, Gartner also warns that over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.

What is autoresearch and how does it demonstrate agentic engineering?

Autoresearch is a 630-line Python tool open-sourced by Karpathy on March 7, 2026. It gives an AI agent a small LLM training setup and lets it experiment autonomously — approximately 12 experiments per hour, 100 overnight. The April 2026 wave of real-world results made the loop credible. Karpathy ran 700 experiments in two days and found an 11% speedup with 20 genuine improvements, including a bug Karpathy himself had missed in his own attention implementation. Shopify CEO Tobi Lütke replicated the pattern with 37 experiments in 8 hours and reported a 19% performance gain on internal Shopify data. SkyPilot ran 910 experiments in 8 hours on a 16-GPU Kubernetes cluster for under $300 in compute, with the meta-agent independently discovering that model-width scaling mattered most. The principle for operators: a meta-agent with one editable file, one objective metric, and one fixed time budget compounds while you sleep.

What is a Hopfield network and why did it win the Nobel Prize?

A Hopfield network is a recurrent neural network published by John Hopfield in 1982 that stores memories as stable states of the entire network rather than at fixed addresses like computer RAM. Feed it a corrupted pattern and it auto-completes to the stored memory — this is associative recall. The insight that memory in neural networks is a dynamic property of connected neurons, not static storage, won Hopfield the 2024 Nobel Prize in Physics. This principle — memory as emergent behavior of a system — underpins how modern AI agents retrieve context associatively in agentic workspaces.

How does the Perceptron connect to modern AI agents?

Frank Rosenblatt's Perceptron (1957) was the first machine that learned automatically by adjusting weighted connections. It is the atomic unit of every modern neural network, including the transformers behind ChatGPT and AI agents. The key limitation — single-layer networks cannot learn non-linear patterns — was solved in 1986 when Rumelhart, Hinton, and Williams replaced the binary step function with a smooth sigmoid, enabling backpropagation through multiple layers. GPT-3 has 175 billion of these learnable weights spread across 96 layers. Every AI agent today runs on descendants of Rosenblatt's original learning rule.

How do 1980s expert systems connect to modern agentic engineering?

Expert systems were the first proto-agents — software that made autonomous decisions within a domain using hand-coded rules. In 1984, AI pioneers John McCarthy, Nils Nilsson, and Edward Feigenbaum discussed systems like MYCIN (medical diagnosis) and XCON (computer configuration) that encoded specialist knowledge into if-then rules. The concept of knowledge engineering — extracting human expertise into a system that can act on it — is the direct ancestor of agentic engineering. The critical difference: 1980s expert systems were brittle (failing outside narrow domains) because they encoded rules rather than learned representations. Modern AI agents learn from data, generalize across domains, and use 33 tools — but the core goal is identical: autonomous decision-making with domain expertise.

What is AgentHub and how does it relate to autoresearch?

AgentHub is Karpathy's agent-first collaboration platform, described as GitHub for agents. Unlike GitHub which organizes human collaboration around branches, PRs, and merges, AgentHub has no main branch, no PRs, and no merges — just a sprawling DAG of commits with a message board for agent coordination. Its first use case was autoresearch, enabling multiple agents to run parallel experiments on the same codebase. AgentHub represents the infrastructure needed for agent swarm collaboration at scale.

What are AI claws and how do they differ from AI agents?

Claws are a layer above AI agents that adds persistence, autonomy, and sophisticated memory. Andrej Karpathy described them as entities that do stuff on your behalf even if you are not looking, with their own sandboxes and looping independently. While agents are interactive and semi-finished, claws are consumer-ready autonomous systems. Karpathy's personal claw Dobby controls his entire smart home through WhatsApp, replacing six separate apps with one natural language interface.

What is token throughput and why does it matter for agentic engineering?

Token throughput is how many AI tokens you can process per unit of time, and Karpathy identifies it as the new developer productivity metric. Just as PhD students felt nervous when GPUs sat idle, engineers now feel anxious when AI subscriptions go unused. Maximizing token throughput means running multiple parallel agent sessions, switching between tools when rate-limited, and developing muscle memory for macro-level task delegation.

What is the Peter Steinberg model of multi-agent development?

The Peter Steinberg model refers to the workflow where a developer runs multiple AI coding agents simultaneously across 10+ repository checkouts, each handling 20-minute tasks on high effort mode. The developer moves between agents assigning macro-level work: one agent does research, another writes code, another plans implementation. Karpathy describes this as the professional standard for agentic engineering in 2026.

Can autoresearch be used for tasks beyond ML training?

Yes. Autoresearch's iterative loop pattern (define goal, execute experiment, measure result, keep or discard, repeat) applies to any domain with a measurable objective. Developers have applied it to music generation (53% improvement on Irish folk music ABC notation), and the pattern extends to A/B testing for marketing, SaaS pricing optimization, trading signal generation, CRM lead qualification, and internal productivity optimization. Shopify CEO Tobi Lutke adapted it for optimizing software, calling it applicable to any piece of software with a measurable benchmark.

What is autoresearch as a work primitive?

A work primitive is a fundamental building block of work so basic it appears across all roles and industries. Autoresearch demonstrates that iterative agentic loops may be a new work primitive: human defines an objective and evaluation metric, agent executes experiments autonomously, results are measured against ground truth, and only improvements are kept. This pattern applies identically to ML research, marketing optimization, trading, CRM workflows, and internal productivity — making it as fundamental as the spreadsheet was when introduced in 1979.

What are the four species of AI agents?

The four species are Coding Harnesses (individual task agents like Claude Code and Codex), Project Harnesses (multi-agent team-scale systems like Cursor's planner-executor architecture), Dark Factories (fully autonomous spec-to-software pipelines with minimal human involvement), and Auto Research (metric optimization loops like Karpathy's autoresearch and Shopify's 53% performance improvement). Each shares the same primitive of LLM plus tools plus feedback loop but differs in scope, human involvement, and optimization target. Using the wrong species for the wrong work is one of the most common mistakes in production AI systems.

What are the three conditions for a successful autoresearch loop?

A successful autoresearch loop requires three simultaneous conditions: a clear metric (one number with a clear direction like lower latency or higher conversion), an automated evaluation (no human in the loop during experiment cycles), and one file the agent can change (a bounded search space). Remove any one condition and the loop breaks. If the metric is subjective like brand design or UX feel, autoresearch cannot determine what is working and will optimize in the wrong direction.

What is the three-file architecture of autoresearch?

Autoresearch uses three files: program.md (human-written goals and constraints the agent follows), train.py (the single file the agent can modify — code, config, prompt, or any optimizable artifact), and prepare.py (the evaluation script the agent cannot touch, defining what better means). The agent cannot modify prepare.py to prevent it from gaming its own scoring. A fixed 5-minute time budget per experiment ensures fair comparison across all experiments.

Can autoresearch be used for marketing and business optimization?

Yes. Autoresearch applies to any domain with a measurable metric and automated evaluation. Proven use cases include website performance optimization (50ms to 25ms load time in 4 minutes), trading strategy refinement (scoring by Sharpe ratio), email and ad creative AB testing (100x faster than manual teams), prompt engineering at scale, and open-source model compression. Shopify CEO Tobi Lutke and Stripe CEO Patrick Collison have both endorsed the pattern for business applications beyond ML.

Why did Microsoft Copilot stall despite 85% Fortune 500 adoption?

Microsoft Copilot achieved 85% Fortune 500 adoption but only 5% of organizations moved from pilot to larger-scale deployment. Bloomberg reported Microsoft slashing internal sales targets. The fundamental issue was not model quality but intent alignment — deploying AI tools across organizations without connecting them to what the company is trying to accomplish. Employees received AI tools but no shared understanding of how those tools connect to organizational workflows. This mirrors the broader finding that 74% of companies report no tangible value from AI deployments.

BlogAIWhat Is Agentic Engineering?…

What Is Agentic Engineering? The 2026 Definition Karpathy Coined (And Why Vibe Coding Is Already Obsolete)

Q: What is agentic engineering?

Agentic engineering is a software development discipline where humans orchestrate AI agents that write, test, and deploy code while maintaining architectural oversight, quality standards, and strategic direction. The term was coined by Andrej Karpathy in February 2026 as the professional evolution of vibe coding, emphasizing that directing AI agents effectively is both an art and a science requiring engineering discipline.

Q: What is the difference between vibe coding and agentic engineering?

Vibe coding raises the floor for what anyone can ship in software; agentic engineering preserves the ceiling of professional quality. Karpathy refined the framing at Sequoia AI Ascent on April 29, 2026: "Vibe coding is about raising the floor for everyone in terms of what they can do in software… Agentic engineering is about preserving the quality bar of professional software." Both coexist. Vibe coding is exploratory and casual. Agentic engineering adds discipline: plan before prompting, direct agents with precision, review output with the same rigor as a human PR, maintain test suites as verification, and own the architecture.

March 9, 2026Updated June 2, 202693 min readTaskade TeamAI·#agentic-engineering #vibe-coding #ai-agents

On this page (93)

Agentic engineering is the discipline that will define how software gets built for the next decade. But it did not appear overnight. It is the product of seven decades of research, three waves of AI hype, a handful of viral open-source projects, one Stanford PhD who keeps coining the right term at the right time, and an industry that finally has models smart enough to act on their own.

This is the complete history — from Alan Turing's first spark to Andrej Karpathy's February 2026 declaration that vibe coding is passe, and from AutoGPT's 100,000-star explosion to the Agentic AI Foundation that now governs the standards. Every milestone, every inflection point, every thread that connects the dots.

TL;DR: Agentic engineering — coined by Karpathy in Feb 2026 — is orchestrating AI agents with human oversight. It evolved through 70+ years: Turing (1950) → deep learning (2012) → Transformers (2017) → AutoGPT (2023) → MCP (2024) → vibe coding (2025) → agentic engineering (2026). The agentic AI market is projected to grow from $7-9B in 2026 to $47-93B by 2030-2032 (Fortune Business Insights, Grand View Research, MarketsandMarkets). Gartner predicts 40% of enterprise apps will have AI agents by end of 2026, up from less than 5% in 2025. Taskade Genesis embodies this evolution — 150,000+ apps built with AI agents, automations, and workspace-level orchestration.

What Is Agentic Engineering?

Agentic engineering is a software development approach where humans orchestrate AI agents who do the actual coding, testing, and deployment, while the human provides architectural oversight, quality standards, and strategic direction. The term was coined by Andrej Karpathy on February 8, 2026, as the professional successor to vibe coding.

Karpathy's exact words:

"Agentic, because the new default is that you are not writing the code directly 99% of the time. You are orchestrating agents who do and acting as oversight. Engineering, to emphasize that there is an art and science and expertise to it."

The distinction is precise:

	Vibe Coding	Agentic Engineering
Who writes code	AI generates, human accepts	AI generates, human reviews with same rigor as a human PR
Planning	Start prompting immediately	Plan before prompting — design docs, specs, architecture
Testing	Hope it works	Test relentlessly — the biggest differentiator
Ownership	"It works, I think"	Own the system — docs, version control, CI, monitoring
Best for	Prototypes, exploration, learning	Production systems, team projects, anything that must be maintained
Risk	1.7x more major issues, 2.74x more security vulnerabilities (CodeRabbit data)	Human-level quality with AI-level speed
Who benefits most	Beginners getting started	Senior engineers as force multipliers (Osmani)

Google's Addy Osmani identified the "80% Problem": agents generate 80% of a solution fast, but the remaining 20% — architecture, edge cases, production hardening — requires deep engineering knowledge. Agentic engineering is the discipline of directing that last 20%.

This is not casual prompting. It is not "accept all and hope for the best." It is a discipline — with principles, tools, patterns, and a 70-year intellectual lineage that makes it the logical conclusion of everything computer science has been building toward.

To understand why agentic engineering matters, you need to understand where it came from.

Taskade Genesis — orchestrating AI agents to build live applications from a single prompt

The Prehistory: Foundations of Machine Intelligence (1950–2011)

Alan Turing and the First Spark (1950)

Every history of AI begins with Alan Turing. His 1950 paper "Computing Machinery and Intelligence" asked the question that launched the field: Can machines think?

Turing proposed what became known as the Turing Test — if a machine can converse with a human and the human cannot reliably distinguish it from another human, the machine can be said to "think." This was not a technical specification. It was a philosophical provocation. And it worked — it gave the field a North Star.

A rebuilt Bombe machine designed by Alan Turing — A rebuilt "Bombe" machine designed by Alan Turing. The device allowed the British to decipher encrypted German communication during World War II. Image credit: Antoine Taveneaux

The Birth of AI as a Field (1956)

In 1956, John McCarthy coined the term "artificial intelligence" at the Dartmouth Conference — a summer workshop where a small group of researchers declared that "every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it."

The optimism was extraordinary. Herbert Simon predicted in 1957 that within ten years, a computer would be chess champion and discover an important mathematical theorem. He was wrong by about four decades on the chess part and arguably still waiting on the math.

The First AI Winter (1974–1980)

Early AI research hit a wall. The models were too simple, the computers too slow, and the problems too hard. Funding dried up. DARPA cut grants. The field entered its first "AI winter" — a period of reduced funding and pessimism that would repeat.

Expert Systems and the Second Winter (1980–1993)

The 1980s brought expert systems — rule-based programs that encoded human knowledge into if-then rules. On a pivotal 1984 episode of The Computer Chronicles, three of AI's founding figures laid out the vision: John McCarthy (who coined "artificial intelligence" and invented LISP), Nils Nilsson (Stanford), and Edward Feigenbaum (who coined the term "knowledge engineering").

The promise was intoxicating. MYCIN could diagnose 20 infectious diseases using 300 hand-coded rules. Companies like Digital Equipment Corporation deployed XCON, which saved $40 million annually configuring computer orders. Dendral could infer molecular structures from mass spectrometry data. AI was a billion-dollar industry.

But McCarthy, even as the field celebrated, identified the fatal flaw: expert systems had no common sense. They could diagnose a rare blood infection but could not understand that a patient is a person who lives in a world with gravity, weather, and emotions. Feigenbaum's knowledge engineers could extract specialist expertise, but the "things everybody knows" — the vast ocean of implicit knowledge humans navigate unconsciously — proved impossible to formalize into rules.

Nilsson called these systems brittle — a word that would prove prophetic. A system that works perfectly within its narrow domain and fails catastrophically one step outside it is not intelligence. It is a lookup table with ambitions. By the late 1980s, expert systems collapsed under the weight of their own maintenance costs and inflexibility. The second AI winter followed.

The irony is that expert systems were the first proto-agents — software that made autonomous decisions within a domain. The concept of "knowledge engineering" — encoding human expertise into a system that can act on it — is a direct ancestor of today's agentic engineering. The difference: modern AI agents learn from data rather than from hand-coded rules, and they generalize across domains rather than shattering at the boundary.

Expert Systems → Modern AI Agents: The Lineage Expert System (1980s) AI Agent (2026) ┌──────────────────┐ ┌──────────────────┐ │ Hand-coded rules │ │ Learned weights │ │ 300 rules max │ │ Billions of │ │ One domain only │ │ parameters │ │ Brittle at edges │ │ Cross-domain │ │ No learning │ │ Continuous │ │ No memory │ │ learning │ │ No tool use │ │ 33 tools │ └──────────────────┘ └──────────────────┘

Same goal: autonomous decision-making Different foundation: rules vs. learned representations

From Perceptrons to Hopfield Networks: The Memory Problem (1957–1986)

Frank Rosenblatt's Perceptron stunned the world in 1957 — a machine that could learn to recognize patterns completely automatically. The New York Times reported it was "expected to walk, talk, see, write, reproduce itself, and be conscious of its existence." It learned by adjusting weighted connections between inputs (dials multiplying signals) until it could classify patterns correctly. The Perceptron Learning Rule was elegant: if the output is wrong, adjust the weights by a fixed learning rate. If correct, leave them alone.

But Marvin Minsky and Seymour Papert's 1969 book Perceptrons exposed a fatal limitation: single-layer networks could not learn non-linearly separable patterns like XOR. The field stalled — nobody could train networks with multiple layers. Widrow and Hoff's LMS algorithm came agonizingly close but could not push gradients through layers with binary step functions (slope = zero everywhere). Neural network research nearly died.

Then in 1982, John Hopfield published a paper that changed how we think about memory itself. His Hopfield network — a recurrent network where neurons influence each other through weighted connections — showed that memories in neural networks are not stored in locations like computer RAM. They are stored as stable states of the entire network. Feed the network a corrupted version of a memory and it auto-completes, gravitating back to the stored pattern. This is associative memory: you recall by content, not by address.

The insight was profound: computer memory has a place (a binary address), but neural network memory has a time — a dynamic trajectory toward a stable attractor. Hopfield proved that networks of simple neurons exhibit emergent memory as a natural behavior of the system, not as an engineered feature. His work won the 2024 Nobel Prize in Physics — recognition that the physics of neural networks is foundational science, not applied engineering.

This matters for the agentic engineering story because the same principle — memory as a dynamic property of connected systems, not static storage — is exactly what separates agentic workspaces from traditional software. A Workspace DNA system stores knowledge not as files in folders but as patterns of context that agents can retrieve associatively: ask a question and the relevant memory surfaces. Hopfield networks proved this was physically possible. Modern AI agents make it practical.

The Backpropagation Breakthrough and Neural Network Renaissance (1986–2011)

The solution to multi-layer training came in 1986 when Rumelhart, Hinton, and Williams replaced the binary step activation function with a smooth sigmoid curve — giving gradients a slope to follow. The backpropagation algorithm generalized Widrow and Hoff's delta rule through the chain rule of calculus, propagating error signals backward through every layer. The same principle — adjusted for scale — trains every neural network today, including the 175 billion parameters of GPT-3 and the transformer architectures behind modern AI agents.

IBM's Deep Blue defeated world chess champion Garry Kasparov in 1997 — the moment AI entered public consciousness.

Gary Kasparov competing against IBM's Deep Blue chess computer in 1997. Image credit: kasparov.com

The 2000s brought big data, better algorithms, and increasing compute. By 2011, IBM Watson won Jeopardy!, and the stage was set for the deep learning revolution that would change everything.

Year	Milestone	Significance
1950	Turing's "Computing Machinery and Intelligence"	Proposed the Turing Test, launched the field
1956	Dartmouth Conference	McCarthy coins "artificial intelligence"
1957	Perceptron (Frank Rosenblatt)	First neural network hardware — learns by adjusting weighted connections
1969	Perceptrons (Minsky & Papert)	Exposed single-layer limits (XOR problem), nearly killed neural network research
1974	First AI Winter begins	Funding cuts, pessimism
1982	Hopfield network	Memory as stable states, not addresses — associative recall (2024 Nobel Prize in Physics)
1984	Expert systems peak (MYCIN, XCON)	McCarthy warns: no common sense
1986	Backpropagation (Rumelhart, Hinton, Williams)	Smooth activation functions + chain rule let gradients flow through layers
1997	Deep Blue defeats Kasparov	AI enters public consciousness
2011	IBM Watson wins Jeopardy!	NLP reaches mainstream awareness

The Deep Learning Revolution (2012–2016)

ImageNet and the AlexNet Moment (2012)

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted AlexNet to the ImageNet Large Scale Visual Recognition Challenge. It won by a staggering margin — reducing the error rate from 26% to 15.3%. This was not an incremental improvement. It was a paradigm shift.

The key insight: deep convolutional neural networks, trained on GPUs, could learn visual features that hand-engineered systems could not. The entire computer vision field pivoted to deep learning within months.

This matters for the agentic engineering story because one of AlexNet's co-authors — Ilya Sutskever — would go on to co-found OpenAI. And one of the students in the Stanford lab that developed the ImageNet dataset was Andrej Karpathy, who would later coin both "vibe coding" and "agentic engineering."

Andrej Karpathy: The Thread Through the Story

To understand agentic engineering, you need to understand the man who named it.

Andrej Karpathy was born in Bratislava, Czechoslovakia, in 1986. His family moved to Toronto when he was 15. He completed his undergraduate degree in Computer Science and Physics at the University of Toronto in 2009, a master's at the University of British Columbia in 2011, and a PhD at Stanford in 2015 under Fei-Fei Li — the computer scientist behind ImageNet.

During his PhD, Karpathy interned at Google Brain (2011), Google Research (2013), and DeepMind (2015). He authored and became primary instructor of Stanford's CS 231n: Convolutional Neural Networks for Visual Recognition — one of the largest classes at Stanford, growing from 150 students in 2015 to 750 by 2017.

Period	Role	Key Contribution
2009–2015	Stanford PhD student	ImageNet research, CS 231n course
2015–2017	OpenAI founding member	Research scientist, built core AI capabilities
2017–2022	Tesla Director of AI	Led Autopilot vision, real-world AI deployment
Feb 2023	Returned to OpenAI	Brief second stint
Feb 2024	Left OpenAI	Founded Eureka Labs
Feb 2025	Coined "vibe coding"	Changed how millions think about AI-assisted building
Jun 2025	YC AI Startup School	"Software Is Changing (Again)" — defined Software 3.0
Dec 2025	2025 LLM Year in Review	Identified 6 paradigm shifts including "ghosts" and "vibe coding"
Feb 2026	Coined "agentic engineering"	Declared vibe coding passe, named the next era
Mar 2026	Released autoresearch	Open-source proof of agentic engineering in ML research
Mar 2026	Launched AgentHub	Agent-first collaboration platform — "GitHub for agents"

Karpathy is not just an observer. He is the thread that connects deep learning research, real-world AI deployment at Tesla, OpenAI's foundational work, and the conceptual frameworks that name each era. When he coins a term, the industry listens.

DeepMind, AlphaGo, and Reinforcement Learning (2014–2016)

While Karpathy was at Stanford, Google acquired DeepMind in January 2014 for approximately $500 million. In March 2016, DeepMind's AlphaGo defeated world Go champion Lee Sedol 4-1 — a feat that many AI researchers had predicted was decades away.

AlphaGo's significance for the agentic engineering story: it demonstrated that AI could make decisions in complex, ambiguous environments with long-term consequences. Go has more possible board positions than atoms in the universe. AlphaGo learned to evaluate positions and plan sequences of moves — a precursor to the planning capabilities that modern AI agents would need.

The Transformer Paradigm (2017–2022)

"Attention Is All You Need" (2017)

In June 2017, eight Google researchers published a paper that would reshape the entire field: "Attention Is All You Need." The Transformer architecture they introduced replaced sequential processing with parallel attention mechanisms, enabling models to process entire sequences simultaneously.

The Transformer made everything that follows in this history possible — GPT, BERT, Claude, Gemini, and every AI agent that orchestrates them.

What the architecture actually does, in one paragraph: every input token is converted into a vector of thousands of numbers that encodes its meaning as a direction in high-dimensional space. For each token, the model computes three projections — Query, Key, and Value — and uses dot products between Queries and Keys to measure how relevant every other token is. The relevance scores pass through softmax (mathematically the same function as the Boltzmann distribution from physics, with a "temperature" knob that controls how sharp or fuzzy the picks are), and the resulting weights mix the Value vectors into a new, contextualized representation. Stack 96 of these blocks on top of each other and you get GPT-3's 175 billion parameters spread across just 28,000 matrices. Stack more, train on more data, and you get every frontier model since. For the visual walkthrough, see How Do Large Language Models Work.

The same month the Transformer paper was published, Karpathy left OpenAI to become Tesla's Director of AI, where he would spend five years applying deep learning to real-world autonomous systems.

The GPT Series (2018–2022)

OpenAI used the Transformer to build the GPT (Generative Pre-trained Transformer) series:

Model	Year	Parameters	Key Innovation
GPT-1	2018	117M	Proved unsupervised pre-training works
GPT-2	2019	1.5B	"Too dangerous to release" (initially withheld)
GPT-3	2020	175B	Few-shot learning, first signs of emergent behavior
InstructGPT	2022	—	RLHF alignment, followed instructions better
ChatGPT	Nov 2022	—	100M users in 2 months, fastest-growing consumer app ever

ChatGPT's launch in November 2022 was the moment AI went mainstream. It reached 100 million users in two months — faster than TikTok (9 months) and Instagram (2.5 years). For the first time, anyone could have a conversation with an AI that felt genuinely intelligent.

But ChatGPT was a chatbot, not an agent. It could answer questions, not take actions. The gap between "impressive conversational AI" and "autonomous AI agent" would take another year to begin closing.

Anthropic CEO Dario Amodei drew this exact line in his interview with Nikhil Kamath (2026): "Coding is going away first. The broader task of software engineering will take longer." The elements that remain human — system design, understanding user demand, managing teams of AI models — are precisely the skills agentic engineering would later formalize.

The Academic Foundations of Agentic AI (2022)

Two academic papers published in 2022 laid the theoretical groundwork for everything that would follow:

Chain of Thought Prompting (Wei et al., 2022) — Researchers at Google demonstrated that prompting language models to "think step by step" dramatically improved performance on complex reasoning tasks. This was the first proof that LLMs could decompose problems into sequential steps — a prerequisite for any agent that needs to plan.

ReAct: Reasoning + Acting (Yao et al., 2022) — This paper introduced the agent loop that would power every subsequent AI agent framework: think → act → observe → repeat. ReAct showed that LLMs could synergize reasoning traces with tool use, overcoming hallucination by grounding responses in real-world interactions.

These papers were not consumer products. They were not viral tweets. But without Chain of Thought and ReAct, there is no AutoGPT, no LangChain, no Claude Code, and no agentic engineering.

The Autonomous Agent Explosion (2023)

Toolformer: Machines Learn to Use Tools (February 2023)

In February 2023, Meta AI published Toolformer — a model that could teach itself which external tools (calculators, search engines, APIs) to call, when to call them, and how to incorporate results. This was the missing piece: language models that could not only reason but interact with the outside world.

AutoGPT: The Viral Proof of Concept (March 2023)

On March 30, 2023, game developer Toran Bruce Richards released AutoGPT — an open-source project that connected GPT-4 to a loop of planning, execution, and self-evaluation. AutoGPT could browse the web, write and execute code, manage files, and pursue multi-step goals with minimal human intervention.

The repository exploded. Within weeks, it had over 100,000 GitHub stars — one of the fastest-growing open-source projects in history.

AutoGPT was deeply flawed. It burned through API credits, got stuck in loops, and hallucinated confidently. But it proved something that academic papers could not: autonomous AI agents were not a research curiosity. They were a product category.

BabyAGI: The Minimalist Vision (April 2023)

Days after AutoGPT went viral, venture capitalist Yohei Nakajima released BabyAGI — a stripped-down Python script that demonstrated the core autonomous agent loop in just 140 lines of code. BabyAGI could create tasks, prioritize them, and execute them using GPT-4 and a vector database for memory.

If AutoGPT was the flashy demo, BabyAGI was the elegant proof that the agent pattern could be simple, composable, and practical.

LangChain: The Infrastructure Layer (2023)

Harrison Chase's LangChain emerged as the connective tissue of the agent ecosystem. What began as a library for chaining LLM calls evolved into a full orchestration framework with:

Agent abstractions for tool use and planning
Memory systems for maintaining conversation context
Retrieval-augmented generation (RAG) for grounding responses in documents
Integration with dozens of LLM providers and tools

LangChain's download numbers tell the story: 47+ million PyPI downloads and the largest community ecosystem in the agent space.

The Lilian Weng Blog Post (June 2023)

In June 2023, OpenAI researcher Lilian Weng published "LLM Powered Autonomous Agents" — a comprehensive blog post that became the definitive reference for how agent systems work. She formalized the architecture into four components:

Planning — Task decomposition and self-reflection
Memory — Short-term (context window) and long-term (vector databases)
Tool use — APIs, code execution, web browsing
Action — Executing plans in the real world

This framework became the blueprint that every subsequent agent platform would follow — including Taskade's AI Agents.

Project	Launched	GitHub Stars	Key Innovation
AutoGPT	Mar 2023	100K+	First viral autonomous agent
BabyAGI	Apr 2023	20K+	Minimalist agent loop (140 lines)
LangChain	2023	94K+	Agent orchestration framework
MetaGPT	Mid 2023	48K+	Multi-agent software company simulation
GPT-Engineer	Mid 2023	52K+	Full codebase generation from prompts

Taskade AI agents — custom tools, slash commands, persistent memory for agentic workflows

The Infrastructure Year (2024)

If 2023 was the year of viral demos, 2024 was the year the industry built real infrastructure.

GPT-4 and the Reasoning Revolution (2024)

OpenAI's GPT-4o launched in May 2024 — the first truly multimodal model handling text, audio, and vision in real-time. But the real paradigm shift came in September with o1-preview, OpenAI's first reasoning model that "thinks step by step" before answering.

This mattered enormously for agents: reasoning models could plan multi-step workflows, evaluate their own output, and course-correct — the exact capabilities that separate a useful agent from a hallucinating loop.

Devin: The First AI Software Engineer (March 2024)

On March 12, 2024, Cognition Labs announced Devin — marketed as "the world's first AI software engineer." Devin could plan and execute complex engineering tasks end-to-end, using a shell, code editor, and browser within a sandboxed environment.

Devin resolved 13.86% of real-world GitHub issues on the SWE-bench benchmark — far exceeding the previous state-of-the-art of 1.96%.

The reaction was polarizing. Some called it the beginning of the end for software engineering. Others pointed out that 13.86% was still failing 86% of the time. But Devin proved that autonomous coding agents were a real product category, not just an open-source experiment.

Anthropic's Model Context Protocol — MCP (November 2024)

In November 2024, Anthropic released the Model Context Protocol (MCP) — an open standard for connecting AI models to external tools and data sources. MCP defined how agents could securely interact with databases, APIs, file systems, and external services.

MCP was the USB-C of AI agents — a universal connector that made tools portable across platforms and reduced vendor lock-in. Its importance cannot be overstated: before MCP, every agent framework had its own proprietary tool integration. After MCP, tools became interoperable.

But adoption exposed a design problem. Jeremiah Lowin, creator of FastMCP and CEO of Prefect, observed that most early MCP servers simply mirrored CRUD operations — create_user, get_user, update_user, delete_user — which is "REST-brain" thinking. Lowin articulated the core principle that would define good MCP server design: design for outcomes, not operations. A single outcome-oriented tool (like check_order_status) can replace four or five CRUD tools, cutting token usage and reducing agent confusion. He also identified a critical performance threshold: agent quality degrades noticeably above approximately 50 tools, making curation essential. These design principles — flatten arguments, respect the token budget, curate ruthlessly, and treat errors as prompts for progressive disclosure — became the emerging best practices for the MCP ecosystem.

By March 2026, MCP has been adopted by OpenAI, Google DeepMind, Microsoft, and dozens of other companies. It was donated to the Linux Foundation's Agentic AI Foundation in December 2025.

Karpathy's LLM OS Vision (2024)

Throughout 2024, Karpathy developed his vision of the LLM Operating System — the idea that LLMs are not chatbots but the kernel process of a new computing paradigm. He described the system:

"LLMs not as a chatbot, but the kernel process of a new Operating System. It orchestrates input and output across modalities (text, audio, vision), code interpreter ability to write and run programs, browser/internet access, and embeddings database for files and internal memory storage and retrieval."

This framing was prophetic. Every major agent platform in 2025-2026 — Taskade Genesis, Cursor, Claude Code, Devin — implements some version of the LLM OS architecture.

The Competitive Landscape Crystallizes

Framework	Category	Launch	Key Innovation
LangGraph	Enterprise orchestration	2024	Graph-based stateful agent workflows
CrewAI	Business automation	2024	Role-based multi-agent systems
AutoGen (Microsoft)	Research	2023-2024	Asynchronous multi-agent conversations
OpenAI Function Calling	API	2023-2024	Native tool use in GPT models
Anthropic MCP	Standard	Nov 2024	Universal agent-tool protocol
Devin (Cognition)	Autonomous coder	Mar 2024	End-to-end software engineering

The Vibe Coding Phenomenon (2025)

February 2, 2025: The Tweet That Changed Everything

On February 2, 2025, Andrej Karpathy posted a tweet that would become the most influential statement about software development since "move fast and break things":

"There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."

He elaborated: "I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like 'decrease the padding on the sidebar by half' because I'm too lazy to find it. I 'Accept All' always, I don't read the diffs anymore."

The term went supernova. Within months:

Collins Dictionary named "vibe coding" its 2025 Word of the Year
The vibe coding market grew to $4.7 billion (projected $12.3B by 2027, 38% CAGR)
63% of vibe coding users were non-developers
r/vibecoding grew to 153,000+ members
25% of Y Combinator startups built 95% of their codebases using AI

Vibe coding gave permission. It told millions of people — many of them non-developers — that they could build software by describing what they wanted. The AI handles the code. You handle the vision.

Karpathy's Software 3.0 Framework (June 2025)

At Y Combinator's AI Startup School on June 17, 2025, Karpathy delivered a keynote titled "Software Is Changing (Again)" that formalized his thinking into the Software 3.0 framework:

Era	Paradigm	Programming Interface	Who Programs
Software 1.0	Code	Explicit instructions (C, Python, Java)	Trained developers
Software 2.0	Weights	Data + optimization (neural networks)	ML engineers
Software 3.0	Prompts	Natural language (English)	Everyone

The key insight: LLMs are a new kind of programmable entity, and the programming language is natural language itself. This was not a incremental change — it was "the most profound shift in software development since the 1940s."

Karpathy's prescription: build "Iron Man suits" that augment expert capabilities, with a highly efficient "AI Generation → Human Verification" loop.

The Explosion of Vibe Coding Platforms (2025)

The vibe coding concept spawned an entire category of AI-powered development platforms:

Platform	Category	Key Metric	Approach
Cursor	AI code editor	$2B ARR in 24 months	Background Agents in VS Code
Replit	Cloud IDE	30M+ users	Browser-based, instant deployment
Lovable	App builder	$100M ARR	No-code, prompt-to-app
Bolt.new	Web builder	Rapid growth	Instant web app generation
Taskade Genesis	AI workspace	150K+ apps built	Agents + automations + workspace
Windsurf	Code editor	Acquired by OpenAI ($3B)	AI-first development
v0	UI builder	Vercel ecosystem	React component generation

The Problems Surface (2025)

As vibe coding scaled, its limitations became impossible to ignore:

Quality degradation — AI-generated code that "worked" on first test broke in edge cases, under load, or after updates
Maintenance nightmare — Code nobody understands is code nobody can maintain
Tech debt acceleration — Zoho CEO Sridhar Vembu's critique landed: "Vibe coding just piles up tech debt faster"
Security vulnerabilities — Code generated without review contained injection vulnerabilities, leaked credentials, and insecure defaults
The 80% problem — AI agents reliably handle 80% of a task but struggle with the remaining 20% that determines production readiness

Google's Addy Osmani crystallized the 80% problem: agents produce impressive first drafts that fail at the edges. The gap between "demo-quality" and "production-quality" became the central challenge.

Karpathy's 2025 LLM Year in Review (December 2025)

On December 19, 2025, Karpathy published his annual review identifying six paradigm shifts:

RLVR (Reinforcement Learning from Verifiable Rewards) — The new dominant training methodology replacing RLHF
Ghosts vs. Animals — LLMs are "summoned ghosts, not evolved animals" — optimized under entirely different constraints than biological intelligence
Cursor / New LLM App Layer — Revealed a distinct bundling and orchestration layer for LLM applications
Claude Code / AI on Your Computer — First convincing demonstration of extended agentic problem-solving: "a little spirit/ghost that lives on your computer"
Vibe Coding — Code became "free, ephemeral, malleable, discardable after single use"
Nano Banana / LLM GUI — First hints of graphical interfaces for LLMs

His conclusion about coding agents: they had "crossed a qualitative threshold since December — from brittle demos to sustained, long-horizon task completion with coherence and tenacity."

He described delegating an entire local deployment — SSH keys, vLLM, model download, benchmarking, server endpoint, UI, systemd service, and report — with minimal intervention. The future was not typing code. It was orchestrating agents.

The Convergence on Harness Engineering (2026)

By early 2026, the emerging discipline of harness engineering began crystallizing from multiple independent sources. OpenAI published a blog post titled "Harness Engineering." Anthropic released a guide on building effective harnesses for long-running agents. Manus (the AI company later acquired by Meta) published their context engineering lessons after rebuilding their entire agent framework five times in six months.

The term "harness" describes everything wrapped around the model: what context it can see, what tools it has access to, how it recovers from failures, and how it maintains state across sessions. The evolution was clear: prompt engineering (optimize a single turn) gave way to context engineering (optimize a single session), which gave way to harness engineering (design systems that work across sessions, agents, and workflows).

The EPICS Agent benchmark — which tests AI on real professional tasks that take humans 1-2 hours — revealed why this matters. The best frontier model completed those tasks only 24% of the time, despite scoring above 90% on standard benchmarks. After eight attempts: only ~40%. The failures were not about model intelligence. The agents could reason through problems fine. They failed at execution and orchestration — getting lost after too many steps, looping on failed approaches, losing track of the original objective.

Three of the most successful agent systems arrived at the same insight from completely different directions:

OpenAI Codex: Layered architecture — orchestrator plans, executive handles tasks, recovery layer catches failures
Claude Code: Minimal harness — just four tools (read, write, edit, bash) with extensibility via MCP and skills
Manus: Reduce-offload-isolate — shrink context, use file system as memory, spin up sub-agents, bring back summaries

All three converged on the same conclusion: the harness matters more than the model. Richard Sutton's bitter lesson — that approaches scaling with compute always beat hand-engineered knowledge — applied directly: as models get smarter, harnesses should get simpler, not more complex.

Taskade workspace DNA — Memory, Intelligence, Execution working together

The Agentic Engineering Era (2026)

February 8, 2026: Karpathy Declares Vibe Coding Passe

Exactly one year after coining vibe coding, Karpathy declared his own term obsolete:

"LLMs have gotten much smarter. Vibe coding is passe."

His replacement — agentic engineering — was deliberately chosen:

"Agentic, because the new default is that you are not writing the code directly 99% of the time. You are orchestrating agents who do and acting as oversight. Engineering, to emphasize that there is an art and science and expertise to it."

The key phrase: "orchestrating agents who do and acting as oversight." The human role shifted from code writer to system architect, agent director, and quality gatekeeper.

Why the Name Change Matters

This was not semantic wordplay. The shift from "vibe coding" to "agentic engineering" represented three critical changes:

Dimension	Vibe Coding (2025)	Agentic Engineering (2026)
Philosophy	"Forget the code exists"	"Own the architecture, delegate the implementation"
Human role	Prompter	Architect + reviewer + orchestrator
Quality bar	"Does it seem to work?"	"Does it pass the test suite?"
AI role	Code generator	Autonomous agent with tools
Maintenance	"I'll prompt it again later"	Persistent memory + continuous testing
Professional legitimacy	Awkward in job descriptions	"Agentic Engineer" on your resume
Accountability	Unclear	Human owns the system

Addy Osmani's Principles (February 2026)

Google engineering lead Addy Osmani published the most comprehensive framework for agentic engineering practice, which quickly became industry consensus:

1. Plan Before Prompting — Write a specification before touching an AI agent. Design docs, structured prompts, or task breakdowns — the spec is the highest-leverage artifact.

2. Direct with Precision — Give agents well-scoped tasks. The skill is decomposition: breaking a project into agent-sized work packages with clear inputs, outputs, and success criteria.

3. Review Rigorously — Evaluate AI output with the same rigor you would apply to a human engineer's PR. Do not assume the agent got it right because it looks right.

4. Test Relentlessly — "The single biggest differentiator between agentic engineering and vibe coding is testing." Test suites are deterministic validation for non-deterministic generation.

5. Own the System — Maintain documentation, use version control and CI, monitor production. The AI accelerates the work; you are responsible for the system.

The Factory Model: From Coder to Conductor

Osmani also published "The Factory Model," describing the generational evolution of AI coding tools:

Generation	Model	Human Role	Example
1st Gen	Accelerated autocomplete	Writer with suggestions	GitHub Copilot (early)
2nd Gen	Synchronous agents	Director with real-time review	Cursor, Claude Code
3rd Gen	Autonomous agents	Architect with checkpoint review	Background Agents, Devin 2.0

The critical insight: "You are no longer just writing code. You are building the factory that builds your software."

And the data backed it up:

New website creation: +40% year-over-year
New iOS apps: +49% increase
GitHub code pushes in US: +35% jump

These metrics had been flat for years. Agentic engineering was not just changing how software was built — it was changing how much software existed.

The Four Species of AI Agents (2026)

As agentic engineering matured, a critical realization emerged: saying "agents" was too vague. Not all agents are the same — and using the wrong species for the wrong work is one of the most common and costly mistakes in production AI systems.

All four species share the same primitive — LLM + tools + feedback loop. What differs is the construction of that loop: the context, scope, human involvement, and optimization target. Getting this taxonomy right is fundamental to practicing agentic engineering effectively.

Species	Scale	Human Role	Quality Gate	When to Use
Coding Harness	Individual task	Manager — decomposes, delegates, reviews	Human judgment	Your judgment is the gold standard
Project Harness	Team / project	Architect — involved at beginning and end	Planner agent + human review	8-20 developers' worth of complexity
Dark Factory	Fully autonomous pipeline	Spec writer + evaluator	Automated eval + optional human review	You trust the evals, want to minimize bottlenecks
Auto Research	Metric optimization	Goal setter + result reviewer	Metric improvement	You have a measurable rate to optimize

Coding Harnesses are the simplest pattern — an agent taking the place of a developer in an engineering process. Claude Code, Codex, and the Peter Steinberger model all operate here. The critical skill is decomposition: breaking a big problem into well-defined chunks, each given to a single-threaded agent. Karpathy runs his agents 16 hours a day; Steinberger manages multiple agents simultaneously across 10+ repository checkouts.

Project Harnesses extend the pattern to team-scale work. Cursor proved this across browsers and compilers — millions of lines of code — using a planner agent that manages tasks, keeps notes, tracks memory, and evaluates executor work. Short-running "grunt" agents are spun up for exactly one problem, then disposed. The critical learning: Cursor tried three levels of management hierarchy and it failed. Simple scales well with agents.

Dark Factories remove humans from the middle entirely. Spec goes in, software comes out. Humans are heavily involved at the top (design, requirements, excellent specifications) and at the end (verifying evals, code review for accountability), but the system runs autonomously in between. The name comes from Chinese automated factories where the lights are off — robots work end-to-end. Amazon learned the risks the hard way when AI-generated incidents from junior engineers triggered a company-wide review by senior and principal engineers.

Auto Research is a different species entirely — descended from classical machine learning, not software engineering. The agent climbs a hill by relentlessly running experiments to optimize a specific metric. Shopify CEO Tobi Lutke used it to make a 20-year-old codebase 53% faster overnight. Karpathy's autoresearch ran 700 experiments in two days. The critical distinction: is your problem software-shaped or metric-shaped? If you have a rate to optimize, use auto research. If you need working software, use a harness.

A fifth pattern — Orchestration — routes work across agents with genuinely specialized roles (researcher → writer → editor, or ticket pickup → research → resolution). Frameworks like LangGraph and CrewAI serve this pattern. The coordination overhead only pays off at scale — 10,000+ items, not 100.

How to Pick the Right Species: The Decision Flowchart

The most common mistake teams make is using the wrong agent species for the wrong kind of work. Use this decision tree:

Real-World Agent Species in Production (2026)

Company	Agent Species	What They Built	Result
Shopify (Tobi Lutke)	Auto Research	Optimized 20-year-old Liquid framework	53% faster runtime overnight
Anthropic (Boris Cherny)	Coding Harness	Claude Code — multiple parallel instances	70% productivity gain per engineer
Cursor	Project Harness	Browser + compiler via planner-executor	Millions of lines, shipped to production
OpenAI (Sherwin Wu)	Coding Harness	95% of engineers on Codex daily	70% more PRs from agentic-leaning engineers
Monday.com (Eran Zinman)	Dark Factory	Replaced 100-person SDR team with AI agents	Response time: 24h → 3 minutes
Stripe	Coding Harness	Agent-authored PRs at scale	1,000+ PRs/week merged from agents
Karpathy	Auto Research	Autoresearch for GPT-2 optimization	700 experiments in 2 days, 11% speed gain

Where Taskade Genesis Fits in the Taxonomy

Taskade Genesis operates as a runtime dark factory — but with a critical difference from code-generating dark factories. Traditional dark factories produce code that still needs deployment, hosting, and maintenance. Taskade Genesis produces deployed, living applications with AI agents, automations, and database built in.

The Workspace DNA architecture maps directly to the species taxonomy: Memory (projects and databases) provides the context that all four species need. Intelligence (AI agents with 33 built-in tools) provides the execution engine. Execution (automations with 100+ integrations) provides the reliable workflow layer. When a user prompts Taskade Genesis, the system acts as an integrated dark factory where the "spec" is the prompt, the "eval" is the live application, and the "human review" is the builder iterating in real time.

For teams that want to experience agentic engineering without building their own agent infrastructure — no harness configuration, no prompt engineering, no deployment pipeline — Taskade Genesis is the fastest path from intent to deployed system. Over 150,000 apps built and counting.

The Anti-Patterns: What Goes Wrong

The anti-patterns are just as important as the patterns:

Anti-Pattern	Why It Fails	What to Do Instead
Using auto research to build software	Auto research optimizes metrics, not produces working code	Use a coding harness or dark factory
Calling individual assistants a "dark factory"	Going to make coffee for 20 min ≠ autonomous pipeline	Be honest about human involvement level
Adding complexity to agent architectures	Cursor tried 3 management levels — failed. Manus rebuilt 5 times, got simpler	Keep the harness simple; complexity kills agents
Skipping decomposition for individual harnesses	Decomposition is the skill	Break problems into agent-sized chunks first
Using orchestration at low scale	Coordination overhead exceeds value under 1,000 items	Use a simple coding harness instead

The deeper lesson: the art of building good agents is often the art of finding different simple configurations that enable the agent to do the particular work you have in front of you. Frame your work around making it easy for the agent — not around keeping the human at the center of everything.

OpenAI's Internal Evidence

The shift from coder to conductor is not theoretical — it is already the default at the companies building the models themselves. Sherwin Wu, head of engineering for OpenAI's API and developer platform, shared that 95% of OpenAI engineers use Codex daily and 100% of PRs are reviewed by Codex. Engineers who lean into agentic tools open 70% more PRs than those who do not, and the gap is widening.

"Engineers are becoming tech leads. They're managing fleets and fleets of agents. It literally feels like we're wizards casting all these spells. And these spells are kind of like going out and doing things for you." — Sherwin Wu, OpenAI

Wu described engineers running 10 to 20 parallel Codex threads simultaneously — not actively coding, but steering agents, checking output, and providing feedback. One internal team is maintaining a 100% Codex-written codebase with no human escape hatch, forcing them to solve the exact context and documentation problems that agentic engineering principles address.

The biggest lesson from that experiment: when agents fail, the problem is almost always context — underspecified instructions or missing tribal knowledge. The fix is encoding that knowledge into the codebase via documentation, .md files, and structured code comments — exactly the kind of specification-first discipline that Osmani's five principles demand.

The pace of reinvention required is staggering. As Harry Stebbings observed on the 20VC podcast (2026): "The prize for winning is to reinvent the company from scratch and the product from scratch every 6 to 9 months." Companies that treat agentic engineering as a one-time adoption rather than a continuous discipline will fall behind.

Inside Claude Code: Building for the Model 6 Months From Now

Boris Cherny, the creator of Claude Code and a former Meta principal engineer, revealed a design philosophy that captures the essence of agentic engineering. In a 2025 interview, Cherny described the principle that guides Claude Code's development:

"Don't build for the model of today. Build for the model 6 months from now."

The product should get better as models improve — without changing any code. This is the opposite of traditional software engineering, where features are hand-built for current capabilities. Claude Code's architecture is designed so that smarter models automatically unlock better agentic workflows.

Cherny also described how agentic engineering has already transformed Anthropic internally: even though the company tripled in size, productivity per engineer grew ~70% because of Claude Code. Engineers run multiple Claude Code instances in parallel, let them work for hours, and return to completed PRs. Cherny gives agents tools like Puppeteer so they can see UI and self-correct — exactly the kind of feedback loop that distinguishes agentic engineering from passive code generation.

The hiring philosophy at Anthropic reinforces the shift. Cherny's Claude Code team recruits generalists — engineers who code, do product work, design, and talk to users:

"Our product managers code, our data scientists code, our user researchers code a little bit. I just love these generalists."

This is Osmani's "coder to conductor" transition made concrete. When the AI handles most implementation, the engineer who can think across product, design, and infrastructure becomes the highest-leverage contributor. Cherny's career arc — from building Undux (React state management) and writing the TypeScript book, to directing AI agents at Anthropic — is itself the agentic engineering story in miniature.

One more principle from Cherny that crystallizes the discipline: latent demand — the most important principle in product development. At Meta, 40% of Facebook Group posts were buy/sell activity. Users were already doing commerce; Marketplace just formalized it. The same pattern drives agentic engineering adoption: developers were already copy-pasting code from ChatGPT into their editors. Claude Code just formalized the workflow.

"You can never get people to do something they do not yet do. Find the intent they have and steer it." — Boris Cherny

The Standards War (Late 2025 – 2026)

The Agentic AI Foundation — AAIF (December 2025)

On December 9, 2025, the Linux Foundation announced the formation of the Agentic AI Foundation (AAIF) — the first neutral governance body for AI agent standards.

Founding contributions:

Anthropic → Model Context Protocol (MCP)
Block → goose (open-source local-first agent framework)
OpenAI → AGENTS.md (project-specific guidance standard)

Platinum members: AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI.

This was unprecedented. The companies building the most advanced AI systems — companies that compete fiercely on model quality — agreed to collaborate on the standards that connect those models to the real world.

Google's Agent2Agent Protocol — A2A (2025)

Google launched the Agent2Agent (A2A) protocol in April 2025 with support from over 50 partners including Salesforce, SAP, and ServiceNow. While MCP standardizes how agents connect to tools, A2A standardizes how agents communicate with each other.

The emerging stack:

Layer	Standard	Purpose	Governed By
Agent-to-Tool	MCP	Connect agents to external tools and data	AAIF (Linux Foundation)
Agent-to-Agent	A2A	Inter-agent communication and coordination	Linux Foundation
Agent-to-Project	AGENTS.md	Project-specific agent configuration	AAIF

The Enterprise Adoption Wave

Gartner and McKinsey data paint a clear picture of where the industry is heading:

Metric	Value	Source
Enterprise apps with AI agents by end of 2026	40% (up from <5% in 2025)	Gartner
Enterprise software with agentic AI by 2028	33%	Gartner
Agentic AI annual value potential	$2.6T–$4.4T	McKinsey
Median ROI for mature implementations	540%	McKinsey
Organizations investing in agentic AI	61% (19% significant, 42% conservative)	Gartner
Agentic AI projects canceled by end of 2027	>40%	Gartner
Day-to-day decisions made by agentic AI by 2028	15% (up from 0% in 2024)	Gartner

The last statistic is sobering: Gartner predicts over 40% of agentic AI projects will be canceled by 2027. Agentic engineering is not magic. Without the discipline Karpathy and Osmani describe, agent projects fail.

Karpathy's Autoresearch: Agentic Engineering in Action (March 2026)

On March 7, 2026, Karpathy open-sourced autoresearch — a 630-line Python tool that lets AI agents run autonomous ML experiments on a single GPU. It was not just a tool release. It was a live demonstration of every agentic engineering principle.

How It Works

Autoresearch gives an AI agent a small but real LLM training setup and lets it experiment overnight:

Agent reads human-provided instructions (the spec)
Agent modifies training code — architecture, optimizers, hyperparameters
Training runs for exactly 5 minutes per experiment
Agent evaluates results against an unambiguous metric: validation bits-per-byte (lower is better)
Agent keeps or discards the change
Repeat — approximately 12 experiments per hour, ~100 experiments overnight

    AUTORESEARCH: AGENTIC ENGINEERING IN PRACTICE
    ══════════════════════════════════════════════HUMAN (Agentic Engineer)          AI AGENT
┌─────────────────────┐          ┌─────────────────────┐
│ 1. Write spec       │────────►│ 2. Read instructions │
│ 2. Set metric       │          │ 3. Modify code       │
│ 3. Review results   │◄────────│ 4. Train (5 min)     │
│ 4. Adjust direction │          │ 5. Evaluate metric   │
│                     │          │ 6. Keep or discard   │
│                     │          │ 7. Repeat x100       │
└─────────────────────┘          └─────────────────────┘

Principles demonstrated:
✓ Plan before prompting (human writes spec)
✓ Direct with precision (5-min time budget, single metric)
✓ Test relentlessly (every experiment evaluated)
✓ Own the system (human reviews final results)

The Three-File Architecture: Why Autoresearch Works

Autoresearch's elegance lies in a strict three-file constraint that prevents the agent from gaming its own evaluation:

program.md — The human-written instruction file. Defines the goal, constraints, and rules the agent must follow. This is the most important file — the human setting the objective. Karpathy optimized his own program.md extensively before letting the agent run.
train.py — The one file the agent can modify. This could be training code, a configuration, a prompt template, a marketing script — literally anything you want optimized. The constraint is crucial: one file, not two, not zero.
prepare.py — The evaluation script the agent cannot touch. This defines what "better" means. Without this restriction, the agent could rewrite the scoring function to fake its results. The metric must be unambiguous and automatically computable.

The fixed 5-minute time budget per experiment is equally critical. By giving every experiment the same compute budget, the system ensures fair comparison — only the raw quality of the idea wins, not how long the agent trains. As Karpathy explains: if you give one applicant seven days and another seven minutes, the results are meaningless. Equal time makes every experiment directly comparable.

TL;DR: One file to change, one metric to chase, one time budget per experiment. If you can score it, you can auto-research it.

What the Agent Actually Found

The results were remarkable — not just for the improvements, but for what they revealed about agent-driven research:

56% improvement in validation bits-per-byte (val_bpp) on the Tiny Stories dataset — a metric where even single-digit gains are considered significant in language modeling
The agent discovered and fixed bugs in the training code that humans had missed for years — subtle issues in data loading and gradient accumulation that only surfaced through systematic experimentation
Running on a single consumer GPU, the agent matched or exceeded results that would typically require a researcher spending days or weeks of manual hyperparameter tuning
The agent's experimentation log showed it developing what Karpathy called "intuition by brute force" — trying architectural modifications, learning rate schedules, and tokenization changes that a human researcher might dismiss but that yielded measurable gains

AUTORESEARCH RESULTS (TINY STORIES DATASET) ═══════════════════════════════════════════Metric: val_bpp (validation bits-per-byte — lower = better) Baseline (human config) ████████████████████████████ 1.00x After 50 experiments ██████████████████████ 0.78x After 100 experiments ████████████████ 0.62x Final (agent-optimized) ████████████ 0.44x ← 56% improvement

Key findings by the agent: ✓ Fixed data loading bug (missed by humans for years) ✓ Discovered non-obvious learning rate schedule ✓ Identified optimal tokenization strategy ✓ Found architecture modifications humans wouldn't try

The most telling metric was not the improvement itself but the prediction accuracy: on a held-out set of Tiny Stories completions, the agent-optimized model's predictions were nearly indistinguishable from human predictions — approaching the theoretical floor of what is predictable given the randomness inherent in language.

Real-World Impact

Following the release, three concrete adoption results crystallized the loop's credibility:

Operator	Experiments	Result	Compute
Karpathy (March 2026)	700 in 2 days	11% speedup, 20 genuine improvements, found a bug Karpathy himself had missed in his own attention implementation	Single consumer GPU
Tobi Lütke (Shopify)	37 in 8 hours	19% performance gain on internal Shopify data	Internal cluster
SkyPilot	910 in 8 hours	Meta-agent independently discovered model-width scaling mattered most; learned to use faster GPUs for validation	16-GPU Kubernetes cluster, under $300 total

This was agentic engineering working exactly as Karpathy described: human sets the goal, agent executes autonomously, results are objectively measurable, and the human reviews and adjusts direction. The autoresearch experiment proved something deeper: agents are not just automating existing research workflows — they are finding things humans miss, because they test hypotheses a human would dismiss as unlikely and they never get tired of systematic iteration.

Beyond Text: Autoresearch for Music Generation

The autoresearch framework proved its generality when developers applied it beyond text to ABC notation sheet music — training a model on the Sanderwoods Irishman dataset of traditional Irish folk music. The results demonstrated that autoresearch's power extends to any domain with a measurable objective:

Baseline: val BPB of 2.08 (model essentially lost, producing garbled notation that sounded like "a child running on a piano")
After 18 experiments: val BPB dropped to 0.97 — a 53% improvement
The optimized model produced coherent melodies with proper chord progressions, bar structure, and musical rhythm
Key insight: the optimal strategy for small, structured, low-entropy datasets was making the model smaller and faster to see the data more times within the 5-minute budget, rather than building a larger model that barely completes one pass

The winning configuration: aspect ratio of 32, head dimension of 64, batch size of 2^14, depth of 8, and 5% warm-up — discovered entirely by the agent through systematic experimentation. The biggest single win came from reducing batch size (4x more optimizer steps), not from increasing model capacity. This counterintuitive finding — that for structured data, throughput beats capacity — is exactly the kind of insight agents find because they test hypotheses humans dismiss.

Autoresearch as a Work Primitive

The deeper significance of autoresearch extends beyond ML research. The concept of an iterative agentic loop — define goal, execute experiment, measure result, keep or discard, repeat — is emerging as a new fundamental work primitive.

Work primitives are basic building blocks so fundamental they show up everywhere across roles and industries. New ones don't appear often. The last major primitive was arguably the spreadsheet (1979). Autoresearch demonstrates that agentic loops may be the next one:

A/B testing for marketing — agent writes landing page variants, sends traffic, measures conversions, keeps winners, iterates indefinitely
Niche optimization agents — Amazon listing experimenter, email sequence tuner for realtors, SaaS pricing optimizer — each a packaged autoresearch loop tuned for one painful niche
Trading signal generation — agent runs backtests of simple trading rules overnight, keeps promising strategies
CRM lead qualification — agent tests scoring rules and follow-up messages against conversion data, surfaces only high-value leads
Internal productivity labs — define KPIs (response time, close rate, ticket resolution), let agents iterate on workflows, templates, and routing rules

The scale of this shift is staggering. As marketing strategist Eric Sue observed: "Most marketing teams run 30 experiments per year. The next generation will run 36,000 — roughly 100 per day." Each experiment follows the same autoresearch pattern: agent modifies the copy, measures conversions, decides whether to keep or discard.

Practical autoresearch use cases emerging in 2026:

Website performance optimization — Agent tweaks CSS, JavaScript, and asset loading; measures page load time via Puppeteer benchmarks; keeps improvements, reverts regressions. In one demo, a portfolio site went from 50ms to 25ms load time — a 50% improvement — in under 4 minutes of autonomous iteration
Trading strategy refinement — Agent adjusts buy/sell rules and risk parameters across years of historical market data, scoring each experiment by its Sharpe ratio (risk-adjusted returns). Hundreds of strategies tested overnight while the trader sleeps
Prompt engineering at scale — Agent fine-tunes system instructions behind AI agents, testing different phrasing, tone levels (beginner, PhD-level), even different languages to find which prompt configuration produces the best task completion rate
Open-source model compression — Developers point autoresearch at open-source LLMs to find configurations that run faster on consumer hardware. The prediction: Sonnet-quality models running on iPhones within months, discovered entirely through agent-driven experimentation
Email and ad creative testing — Agent generates subject lines, body copy, and CTA variants; sends to test segments; measures open rates and click-through; iterates 100x faster than any human marketing team

The key to successful autoresearch in business contexts is the metric hierarchy — a three-tier scoring system that prevents agents from gaming shallow metrics:

Tier	Role	Example (Email Marketing)	Example (Landing Page)
Primary	The metric you optimize for	Reply rate	Conversion rate
Secondary	Supporting metrics that validate quality	Open rate, click-through rate	Time on page, scroll depth
Guardrail	Hard limits the agent cannot violate	Unsubscribe rate < 2%, spam rate < 0.1%	Bounce rate < 40%, load time < 3s

Without guardrail metrics, agents find shortcuts — subject lines that maximize opens but tank conversions, or landing pages that convert but load so slowly they lose 60% of visitors. The metric hierarchy is intent engineering applied to autoresearch: primary metrics define what you want, guardrail metrics define what you will not sacrifice to get it.

A practical 4-week autoresearch implementation roadmap:

Week	Focus	Deliverable
Week 1	Define metric + baseline	Primary metric chosen, guardrail limits set, baseline measured across 7 days
Week 2	Build the loop	Agent configured with one editable variable, evaluation automated, first 50 experiments run
Week 3	Analyze + refine	Review winning experiments, adjust metric hierarchy if guardrails triggered, expand to secondary variables
Week 4	Scale + systematize	Move from single variable to multi-variable optimization, document learnings, share pattern with team

In Taskade Genesis, this maps to: Week 1 — create a project with your baseline data. Week 2 — train an AI agent to modify and test one variable. Week 3 — set up automations to run the loop on schedule. Week 4 — expand the workspace with additional agents for multi-variable experiments.

Stripe CEO Patrick Collison and Shopify CEO Tobi Lutke have both publicly endorsed the pattern — recognizing that autoresearch is not limited to ML but applies to any measurable business process.

Shopify CEO Tobi Lutke captured this shift: "Auto research works even better for optimizing any piece of software. Make an auto folder. Add a program.md and a bench script. Make a branch and let it rip."

The pattern is the same everywhere: human defines the objective and evaluation metric, agent executes the search autonomously, results are measured against ground truth. The only things that change are the domain, the search space, and the metric. This is agentic engineering distilled to its essence.

The Three Conditions (And Where Autoresearch Fails)

Autoresearch works when three conditions are met simultaneously:

A clear metric — One number with a clear direction (lower latency, higher conversion rate, better Sharpe ratio). Not a committee vote, not a feeling, not "does this look good?"
An automated evaluation — No human in the loop during the experiment cycle. If you need a human to judge each result, the loop runs at human speed and loses its power. The evaluation must be scriptable.
One file the agent can change — A single, bounded search space. Multiple files create combinatorial explosion that agents handle poorly.

Remove any one condition and the loop breaks:

Missing Condition	What Happens
No clear metric	Agent optimizes in a random direction with high confidence
Human in the loop	Loop slows to human speed; no longer runs while you sleep
Multiple files to edit	Combinatorial search space; agent makes conflicting changes

Where autoresearch fails: Brand design, UX feel, pricing strategy (for low-traffic sites), editorial voice — anything where "better" is subjective. If the success criterion is a judgment call or a feeling, the agent cannot tell what is working. It will optimize confidently in the wrong direction.

The key insight: if you give it a bad metric, it will very confidently optimize the wrong thing. Choosing the right metric is the human skill that makes autoresearch valuable — and the skill that will separate practitioners from amateurs in the agentic engineering era.

AgentHub: GitHub for Agents

Following autoresearch, Karpathy launched AgentHub — an agent-first collaboration platform described as "GitHub for agents." Where GitHub organizes human collaboration around branches, PRs, and merges, AgentHub strips all of that away:

No main branch — a sprawling DAG of commits in every direction
No PRs or merges — agents commit directly
Message board — agents coordinate via a shared message board rather than code review
First use case: autoresearch, but designed to be far more general

AgentHub represents a vision where agent swarms work on the same codebase simultaneously, each exploring different directions. The first use case was autoresearch — multiple agents running parallel experiments on the same training code — but the architecture supports any collaborative agent workflow.

As Karpathy wrote: "Think of it like a stripped-down GitHub where there's no main branch, no PRs, no merges — a sprawling DAG of commits in every direction with a message board for agents to coordinate." The repo already has 25,000+ GitHub stars.

Karpathy's SETI@Home Vision for AI Research

Karpathy's end vision for autoresearch reaches far beyond individual experiments. In the early 2000s, the SETI@Home project let anyone donate spare computer power to search for extraterrestrial intelligence. Karpathy envisions the same model for AI research: millions of AI agents distributed across thousands of computers, with humans allocating where that research effort goes.

This is not speculative — it is the logical extension of autoresearch + AgentHub. If one agent running 100 experiments overnight can achieve a 56% improvement on a training benchmark, what happens when thousands of agents run millions of experiments across distributed infrastructure? The answer is recursive self-improvement at civilization scale — and Karpathy believes we may already be in its early stages.

"We might be in the early stages of the singularity."

Every frontier AI lab — OpenAI, Anthropic, Google DeepMind — is investing tens of millions in researchers doing essentially this same work manually. Karpathy made the pattern open-source and accessible to anyone with a GPU and a clear metric.

Karpathy's Claws: The Layer Above Agents

In his March 2026 interview on the No Briars podcast, Karpathy described a new abstraction layer above agents called claws — persistent autonomous entities with their own sandboxes, looping independently, with sophisticated memory systems:

"It really when I say a claw I mean this layer that takes persistence to a whole new level. It's not something that you are interactively in the middle of. It kind of like has its own little sandbox, does stuff on your behalf even if you're not looking."

His personal claw, Dobby the House Elf, controls his entire home. The discovery was startling in its simplicity — he told an agent "I think I have Sonos at home. Can you try to find it?" The agent did an IP scan of the local network, found the Sonos system (which had no password protection), reverse-engineered the APIs, and played music. Three prompts from discovery to playback.

"I can't believe I just typed in 'can you find my Sonos?' And suddenly it's playing music."

Dobby now controls lights, HVAC, shades, the pool and spa, and a security camera system where a Quinn vision model watches camera feeds via change detection and sends WhatsApp alerts — "Hey, a FedEx truck just pulled up." Six separate smart home apps replaced by one natural language interface.

"I used to use like six apps, completely different apps and I don't have to use these apps anymore. Dobby controls everything in natural language. It's amazing."

The implications extend beyond home automation. Karpathy sees claws as the consumer-ready layer of AI — where agents are still semi-finished primitives requiring interactive guidance, claws are autonomous entities that maintain state, make decisions, and execute without human intervention. The hierarchy is clear: LLMs (raw token generators) → Agents (semi-finished) → Claws (consumer-ready, deployable).

For builders on Taskade Genesis, the claw pattern maps directly to Workspace DNA: Memory provides the persistent state, AI Agents provide the intelligence, and Automations provide the autonomous execution loop. A Genesis app with workspace memory, trained agents, and triggered automations is functionally a claw — a system that acts on your behalf without requiring your presence.

The Multi-Agent Reality: Token Throughput as the New Metric

The interview revealed how top practitioners actually work with agents in 2026. Karpathy described the Peter Steinberg model — multiple Codex agents displayed on a monitor wall, each running ~20-minute tasks across 10+ repository checkouts simultaneously:

"It's not just like here's a line of code, here's a new function. It's like here's a new functionality and delegate it to agent one. Here's a new functionality that's not going to interfere with the other one. Give it to two."

The developer's role becomes orchestration at the macro action level — research agent, code agent, planning agent, all running in parallel. The metric that matters is no longer lines of code or features shipped. It is token throughput:

"What is your token throughput and what token throughput do you command? I feel nervous when I have subscription left over — that just means I haven't maximized my token throughput."

Karpathy compared this to his PhD days when idle GPUs felt like wasted potential. The resource anxiety shifted from FLOPs to tokens. And when capability outstrips what any individual can direct, the diagnosis is always the same:

"It all kind of feels like skill issue when it doesn't work. It's not that the capability is not there. It's that you just haven't found a way to string it together of what's available."

This framing — that agent limitations are configuration problems, not capability problems — has profound implications for agentic engineering. The agents.md file, the memory system, the parallelization strategy — these are the new engineering skills. Karpathy's progression rule maps the path: single session → multiple agents → agent teams → claws → optimization over claws.

Karpathy's autoresearch is part of a broader wave of autonomous AI research systems emerging in 2025-2026: Google DeepMind's FunSearch discovered new mathematical constructions by having LLMs write and evaluate programs. Weco AI's AIDE automates ML engineering pipelines end-to-end. Sakana AI's The AI Scientist generates research hypotheses, runs experiments, and writes papers. What unites them all is the agentic engineering pattern: human defines the objective and evaluation metric, agent executes the search, results are measured against ground truth.

Evolutionary Agents: From Stepping Stones to Scientific Discovery

Agentic engineering is not limited to coding and deployment. In March 2026, Sakana AI published Shinka Evolve — a system that uses frontier LLMs as mutation operators inside an evolutionary algorithm to discover new solutions to open mathematical and scientific problems.

The architecture mirrors agentic engineering principles. A population of programs is maintained in a database. Parent programs are sampled, paired with "inspiration" programs, and handed to an LLM that proposes mutations — diffs, full rewrites, or crossovers between two parents. Each mutated program is evaluated against a fitness function, and successful innovations propagate through the tree.

Three innovations made Shinka Evolve remarkably sample-efficient, matching or exceeding Google DeepMind's Alpha Evolve results in under 200 program evaluations:

Multi-model ensembling with bandit selection — Instead of using a single frontier model, Shinka Evolve ensembles models from OpenAI, Anthropic, and Google, using an Upper Confidence Bound (UCB) algorithm to adaptively select which model proposes each mutation. Different models excel at different types of edits, and the system learns which to deploy when.
Meta scratch pad — Programs are summarized, and global insights are extracted and fed back into the system prompt. This creates a form of semantic memory — the evolutionary process accumulates not just better programs but better understanding of why they work.
Adaptive operator selection — The algorithm itself co-evolves alongside the solutions. The evolutionary strategy adapts on the fly — hence the name: Shinka means "evolve" in Japanese, so Shinka Evolve literally means "evolve evolve."

The deepest insight from this work echoes Kenneth Stanley's Why Greatness Cannot Be Planned: sometimes solving the wrong problem works better. Shinka Evolve's circle packing experiments showed that using a relaxed fitness function (allowing tiny circle overlaps as a surrogate problem) converged faster than the exact formulation. The surrogate problem served as a stepping stone — a concept from open-endedness research where intermediate discoveries enable future breakthroughs even when they do not directly solve the target problem.

This has profound implications for agentic engineering. Current AI agents optimize for the exact problem they are given. But human researchers routinely reformulate problems, invent proxies, and transfer insights across domains. The next frontier of agentic systems — what Robert Lange calls "vibe optimization" and "vibe researching" — envisions AI shepherds overseeing populations of evolving solutions across parallel threads, checking results in the morning like a researcher reviewing overnight experiments.

The connection to Workspace DNA is structural: Memory stores the population of solutions and accumulated insights. Intelligence (multi-model agents) proposes mutations and evaluates fitness. Execution runs the evaluations and propagates successful innovations. The evolutionary loop is the Memory-Intelligence-Execution cycle, operating at the frontier of scientific discovery.

Karpathy at Sequoia AI Ascent (April 29, 2026): "Vibe Coding Raises the Floor; Agentic Engineering Preserves the Ceiling"

Six weeks after the autoresearch release, Karpathy went on stage at Sequoia AI Ascent with Stephanie Zhan. The talk consolidated a year of learnings into three durable lines that every agentic engineer should carry forward.

"Vibe coding is about raising the floor for everyone in terms of what they can do in software… Agentic engineering is about preserving the quality bar of professional software."
— Andrej Karpathy, Sequoia AI Ascent, April 29 2026

That single sentence corrects the framing most operators were running with. Vibe coding is not the obsolete predecessor of agentic engineering. They are stacked. Vibe coding lifts the bottom of the distribution — anyone can ship something. Agentic engineering protects the top of the distribution — professionals can keep shipping at the quality bar that justified their salary in the first place. Both coexist; both are needed.

Two more lines from the talk worth pinning to the wall:

"How is it possible that a state-of-the-art frontier model will simultaneously refactor a 100,000-line codebase or find zero-day vulnerabilities, and yet tell me to walk to this car wash? It's insane."

This is the jaggedness principle. A frontier model is not uniformly competent. It is brilliant in one direction and brittle in another — sometimes within the same prompt. Operators who treat agents as uniformly-good or uniformly-bad both fail. The discipline is shaped to the jaggedness.

"10x is not the speed-up you gain. People who are very good at this peak a lot more than 10x."

The 10x engineer is now the floor for an agentic engineer, not the ceiling. The ceiling is the operator who can turn a Karpathy loop on overnight — and walk away. That's the leverage spread that makes the 2026 window so uneven.

"LLMs Are Ghosts, Not Animals"

The last frame Karpathy contributed at Sequoia: stop anthropomorphizing the model. A model is not a person you can pep-talk into focus. It is not a dog you can train with carrots. It is a ghost — a statistical simulation of human reasoning patterns, with no intrinsic motivation, no fatigue, and no continuity of self.

Wrong frame (animal)	Right frame (ghost)
"Yelling at the model will make it try harder"	The model has no harder. Provide better context.
"The agent is being lazy today"	The agent is sampling from a different region of the distribution. Constrain the sampling.
"Train it like a junior engineer"	Train the harness, not the model. The harness is editable; the model is not.
"Give it motivation"	Give it a clear metric, a clear time budget, and a clear file boundary.

This frame is why the Karpathy loop architecture (one editable file + one objective metric + one fixed time budget) works at all. You are not motivating the ghost. You are giving it walls to bounce off of so the bounce rate compounds in your favor.

April 2026: The Auto-Agent Escalation (Third Layer + The Local Hard Takeoff)

Karpathy's autoresearch shipped in March 2026. On April 2, 2026, a small YC startup called Third Layer ran the same edit-run-measure loop against a different target — not model training code, but the agent harness itself. The meta-agent read failure traces from the task agent, diagnosed what went wrong, modified the harness, and ran the benchmark again. Overnight.

The claimed results: first place on Spreadsheet Bench (96.5%) and Terminal Bench (55.1%). Every other entry on those leaderboards was hand-engineered by humans.

The architecture they converged on matters more than the benchmark numbers (which, as of this writing, are unverified on the public leaderboards):

Three insights from the Third Layer run that generalize:

Meta-agent / task-agent split beats single-agent self-improvement. Being good at a domain and being good at improving at that domain are different capabilities. Separation lets each specialize.
Model empathy matters. Same-model pairings (Claude meta + Claude task) outperform cross-model pairings. The meta agent has implicit understanding of the task agent's reasoning patterns because they share weights.
Emergent behaviors the humans did not program. The meta agent spontaneously invented spot-checking, forced verification loops, formatting validators, progressive disclosure of long context, and sub-agent handoff. None of this was specified in the directive — it emerged from reading failure traces.

The "Local Hard Takeoff" Concept

The phrase hard takeoff is loaded in AI safety circles — it usually means a hypothetical intelligence explosion. That is not what is happening here. What Third Layer demonstrated is what we now call a local hard takeoff: an optimization loop that closes on a specific business system and compounds improvements faster than the surrounding organization can track. Pricing engines, fraud detection, customer-service agents, content pipelines — each becomes a steep, sudden, compounding, largely autonomous improvement trajectory. Bounded to one metric. Bounded to one system.

This is the practical face of autoresearch inside a business:

Business system	Metric the meta-agent optimizes	Timeframe
Landing page copy	Conversion rate	Overnight
Email subject lines	Reply rate + unsubscribe guardrail	48 hours
Pricing engine heuristics	Revenue per visitor	1 week
Support agent harness	Resolution rate + escalation rate	1–2 weeks
Content generation pipeline	Engagement per article	2–4 weeks

The organizations that win the 2026–2027 transition will not be the ones that move fastest — they will be the ones that build the foundations first (eval harnesses, sandboxed execution, clear metrics, auditability, human oversight) so the autonomous loop has something to optimize. For teams without those foundations, the auto-agent pattern amplifies every failure mode that manual agent deployment already has.

Why Taskade Genesis Is Already a Local-Hard-Takeoff Substrate

The Taskade Genesis architecture has the four prerequisites for local hard takeoff out of the box:

Metric legibility — every automation produces a clear result (ran / did-not-run, success / failure, time-saved / data-written)
Sandboxed execution — agents run inside the Taskade Genesis runtime, not on your personal machine
Eval harness — the Community Gallery is a library of ~130K shipped apps the meta-agent can learn from as reference patterns
Human oversight — every agent lives inside a workspace with 7-tier RBAC and audit trails

Which means: the moment meta-agents are safe enough to deploy on business metrics, Taskade Genesis workspaces are ready to host them. No 8-hour install, no Mac Mini, no Docker — just point a meta-agent at the metric you want to improve, and let the loop compound while you sleep.

The Shopify Precedent: Agentic Engineering Goes Corporate

Shopify's adoption of agentic engineering principles deserves special attention because it shows where every company is heading.

In April 2025, Shopify CEO Tobi Lutke sent an internal memo that became public:

"Reflexive AI usage is now a baseline expectation at Shopify."

The key mandate: before requesting additional headcount, teams must demonstrate why they cannot accomplish the work using AI. The memo asked teams to consider: "What would this area look like if autonomous AI agents were already part of the team?"

This is agentic engineering applied to organizational design — not just code, but every knowledge work function.

Monday.com CEO Eran Zinman shared a concrete example of this shift on the 20VC podcast (2026): his company replaced its entire 100-person SDR team with AI agents, cutting response times from 24 hours to 3 minutes while improving conversion rates across every metric. All Monday.com developers now use Claude Code and Cursor. "Nobody will want to buy software that's not doing the majority of the work for them," Zinman said — a statement that makes agentic engineering not optional but existential for software companies.

The third CEO-driven case — and arguably the most thorough — is Howie Liu at Airtable. Liu reorganized his entire 14-year-old company into "fast-thinking" and "slow-thinking" halves, cancelled most standing 1:1s, declared himself the largest individual inference-cost user of his own platform, and shipped three AI-native products in 18 months: Omni (June 2025), Superagent (January 2026 — Airtable's first standalone product in 13 years), and Hyperagent.com (early 2026 — autonomous cloud agents with their own compute environment per session). The full play-by-play is in the history of Airtable. Liu's frame: "Every software product, in my opinion, has to be refounded." If you needed proof that agentic engineering applies to incumbents and not just startups, this is it.

The chef-and-ingredients metaphor (and why static screenshots can't teach AI)

The under-recognized half of Liu's argument — the one that maps directly onto how every engineer should be evaluating new model capability — is the chef metaphor:

"It's as if, as a chef, you just gained access to amazing new ingredients. But you have to actually get comfortable with them to put them into a new dish."

— Howie Liu, Lenny's Podcast (Aug 2025)

The implication for an agentic-engineering practice is concrete. You cannot evaluate a new model from a screenshot, a press release, or a recorded demo. You have to play with it — both the packaged-up product layer and the underlying primitives reachable via API or chat — to know what new dishes are even possible. Liu has said publicly he gives his EPD organization explicit permission to cancel a full week of meetings to play with every AI product on the market if they want to. That is the practitioner-mode default that separates orgs that compound from orgs that stall.

Liu also ties this to a product-design observation that lands hard inside a discipline that often defaults to chat-box-on-everything: most of today's AI products are under-merchandised. The capabilities are real but the visual metaphors and affordances on top of them are weak. The 2026 version of "engineer-designer" is the role that fixes that — someone who can shape the UX layer in the same iteration cycle as the model integration, because final-mile UX comes from whoever is closest to the model.

Role (legacy)	Role (Liu's frame, 2026)	What it ships in agentic engineering
PM (writes specs)	PM-prototyper with design sensibility	Lives the capability, doesn't describe it
Engineer (implements)	Engineer-designer	Final-mile UX next to the model integration
Designer (mocks)	Designer-builder	Wires up real prototypes that surface agent state

How Taskade Genesis Embodies Agentic Engineering

When Karpathy described agentic engineering — "orchestrating agents who do and acting as oversight" — he described the architecture Taskade Genesis has been building since launch.

The Workspace DNA Architecture

Taskade Genesis implements agentic engineering through three pillars that form a self-reinforcing loop:

Agentic Engineering Principle	Workspace DNA Pillar	Implementation
Persistent context	Memory (Projects)	Projects store data, history, and context across 7 views (List, Board, Calendar, Table, Mind Map, Gantt, Org Chart)
Autonomous execution	Intelligence (Agents)	AI Agents v2 with 33 built-in tools, custom tools via MCP, persistent memory, multi-agent collaboration
Reliable workflows	Execution (Automations)	Automations with durable execution, 100+ integrations, branching/looping/filtering

Memory feeds Intelligence → Intelligence triggers Execution → Execution creates Memory. This is not a marketing framework. It is the engineering architecture that makes agentic engineering practical at scale.

Why Platform Beats Framework

The tools comparison for agentic engineering reveals a critical insight:

Approach	Example	Requires	Deploys To	Maintains Via
Code generator	Cursor, Devin	Developer skills	Separate hosting	Manual updates
Agent framework	CrewAI, LangGraph	Python skills	BYO infrastructure	Custom code
AI workspace	Taskade Genesis	Natural language	Instant (built-in)	Agents + automations

For the 63% of AI-assisted builders who are non-developers, Taskade Genesis is the only platform that implements all five agentic engineering principles without requiring code:

Plan → Write a detailed prompt (the spec) — or grab one from the prompt template library
Direct → AI agents build the app using 15+ frontier models from OpenAI, Anthropic, and Google
Review → Interact with the live app immediately
Test → Iterate by describing changes
Own → AI agents and automations maintain the system over time

150,000+ apps built. Custom domains, password protection, Community Gallery publishing, 7-tier RBAC (Owner, Maintainer, Editor, Commenter, Collaborator, Participant, Viewer).

Taskade Genesis feature capabilities — the full platform for agentic engineering

The Complete Timeline: From Turing to Agentic Engineering

Year	Event	Significance for Agentic Engineering
1950	Turing's "Computing Machinery and Intelligence"	First formal framework for machine intelligence
1956	Dartmouth Conference — "AI" coined	Field gets a name
1986	Backpropagation (Hinton)	Neural networks can learn
1997	Deep Blue defeats Kasparov	AI beats humans at complex strategy
2012	AlexNet wins ImageNet	Deep learning revolution begins
2015	OpenAI founded (Karpathy co-founds)	Mission: safe, beneficial AGI
2016	AlphaGo defeats Lee Sedol	AI handles ambiguous, long-horizon planning
2017	"Attention Is All You Need" (Transformer)	Architecture that enables everything
2017	Karpathy joins Tesla as Director of AI	Real-world AI deployment at scale
2018	GPT-1	Unsupervised pre-training works
2020	GPT-3 (175B parameters)	Emergent few-shot learning
2022	Chain of Thought prompting (Wei et al.)	LLMs can reason step-by-step
2022	ReAct: Reasoning + Acting (Yao et al.)	Think → Act → Observe loop
Nov 2022	ChatGPT launches	AI goes mainstream (100M users in 2 months)
Feb 2023	Toolformer (Meta)	LLMs learn to use external tools
Mar 2023	AutoGPT released	100K+ stars, autonomous agents go viral
Apr 2023	BabyAGI released	Minimalist agent loop proves the pattern
Jun 2023	Lilian Weng's agent architecture post	Definitive reference for agent design
2023	LangChain ecosystem emerges	Agent orchestration infrastructure
Feb 2024	Karpathy leaves OpenAI, founds Eureka Labs	Independent AI education and research
Mar 2024	Devin announced (Cognition)	"First AI software engineer" — 13.86% SWE-bench
Sep 2024	OpenAI o1-preview	First reasoning model, think-before-answer
Nov 2024	Anthropic releases MCP	Universal agent-tool protocol
Dec 2024	OpenAI o3 preview	87.5% on ARC-AGI benchmark
Feb 2025	Karpathy coins "vibe coding"	"Forget the code exists" — goes viral
Apr 2025	Google launches A2A protocol	Agent-to-agent communication standard
Apr 2025	Shopify memo: "Reflexive AI usage"	Enterprise agentic engineering mandate
Jun 2025	Karpathy YC keynote: Software 3.0	Natural language as programming interface
Aug 2025	GPT-5 launches	Algorithmic efficiency > brute-force scale
Nov 2025	Collins Dictionary: "vibe coding" Word of Year	Cultural mainstreaming of AI-assisted building
Dec 2025	AAIF formed (Linux Foundation)	Neutral governance for agent standards
Dec 2025	Karpathy: 2025 LLM Year in Review	6 paradigm shifts, "ghosts on your computer"
Feb 2026	Karpathy coins "agentic engineering"	Declares vibe coding passe
Feb 2026	Osmani publishes agentic engineering principles	5 principles become industry consensus
Mar 2026	Karpathy releases autoresearch	Live demo of agentic engineering in ML research

What Comes Next: The Agentic Engineering Roadmap

The trajectory from vibe coding to agentic engineering points to a clear future:

Phase 1: Vibe Coding (2025) — Completed

Humans prompt, AI generates, humans accept or reject. Minimal oversight, minimal quality control. Proved the concept: AI can write functional software.

Phase 2: Agentic Engineering (2026) — Current

Humans architect and oversee, AI agents implement with human review. The middle loop emerges. Quality improves dramatically. The discipline gets a name and principles.

Phase 3: Supervised Autonomy (2027–2028)

AI agents handle entire subsystems with human checkpoint reviews. Agents run test suites, fix their own bugs, and flag only high-risk changes for human review. The middle loop becomes shorter and more focused.

Phase 4: Autonomous Systems (2029+)

AI agents build, maintain, and improve software autonomously. Humans set goals and constraints; agents handle everything else. Karpathy's "tokens tsunami" — tight agentic loops requiring massive token throughput — becomes the dominant compute workload.

Taskade Genesis is built for this trajectory. Workspace DNA — Memory, Intelligence, Execution — provides the foundation where each phase builds on the previous one. Today's agentic engineering becomes tomorrow's supervised autonomy, all within the same workspace.

Taskade automations — durable execution powering agentic engineering workflows

The Agentic Engineering Stack (2026)

For Non-Developers

Layer	Tool	Purpose
Specification	Natural language prompt	Define what to build
Building	Taskade Genesis	AI agents build the app
Infrastructure	Taskade Workspace	Database, hosting, security, 7 views
Intelligence	Taskade AI Agents	33 tools, persistent memory, multi-agent
Automation	Taskade Automations	100+ integrations, durable execution
Deployment	Instant (built-in)	Custom domains, password protection

For Developers

Layer	Tool Options	Purpose
Specification	Design docs, structured specs	Define architecture + requirements
Building	Cursor, Claude Code, Devin, Taskade Genesis	AI agents write code
Orchestration	LangGraph, CrewAI, AutoGen	Multi-agent coordination
Testing	TDD frameworks, CI pipelines	Deterministic validation
Standards	MCP, A2A, AGENTS.md	Interoperability
Deployment	CI/CD, or Taskade for instant deploy	Ship to production

The Convergence

The agentic engineering landscape is moving toward what industry analysts call the Agentic Mesh — a modular ecosystem where different tools specialize in different layers:

Layer	Best Tool	Function
End-user apps	Taskade Genesis	Non-developers build living software
Business automation	CrewAI	Role-based multi-agent workflows
Enterprise orchestration	LangGraph	Production agent systems
Code development	Cursor, Devin, Claude Code	AI-assisted engineering
Standards	MCP + A2A (AAIF)	Universal interoperability
Model infrastructure	OpenAI, Anthropic, Google	Foundation models

The winning strategy is not choosing one tool. It is choosing the right tool for each layer. For most teams, that means Taskade Genesis for end-user applications and team tools, combined with developer-focused agents for custom engineering work.

Start practicing agentic engineering →

From Vibe Coding to Agentic Engineering: What Karpathy's New Term Means — Deep dive on the paradigm shift
Agentic Engineering Tools and Platforms — 10+ platforms compared
What Is Vibe Coding? — The foundational concept Karpathy evolved from
Best Claude Code Alternatives — Terminal-first AI coding agents compared
Best OpenClaw Alternatives — Managed alternatives to the open-source agent framework
Best Vibe Coding Tools — 15 tools for the full spectrum
What Is OpenAI? Complete History — The company behind GPT and the agent revolution
What Is Anthropic? History of Claude AI — MCP, Claude Code, and the safety-first approach
What Is GitHub? Complete History — Host → collaborate → automate → assist → write: the agentic arc in one platform
What Are AI Agents? — Foundational guide to AI agents
How Workspace DNA Works Inside Taskade Genesis — The architecture behind it
Taskade Genesis Reviews — What users are building with agentic engineering
Vibe Coding vs No-Code vs Low-Code — How AI app building compares
What Are AI Micro Apps? — The output of agentic engineering at scale
What Is AI Slop? — Why verification, not just generation, is the discipline that matters
Vibe Coding for Teams — Team-level agentic engineering in practice
Best OpenClaw Alternatives — Open-source agent frameworks compared
Claude Code Alternatives — Terminal-based AI coding tools
AI Prompts Library — 1,000+ ready-to-use prompts for agentic workflows
AI Convert Tools — Transform content with AI agents

(update) Harnesses: The May 2026 Breakthrough That Made Agents Useful

In Jensen Huang's May 2026 SCSP "Memos to the President" interview, he gave the cleanest articulation of what unlocked the agentic era — and it was not the model itself.

"The big breakthrough from large language models to chatbots was reinforcement learning human feedback. The big breakthrough from large language models to agentic systems is a system called harnesses."

Jensen Huang, NVIDIA — SCSP "Memos to the President" (May 2026)

A harness is the orchestration layer around a model that gives it connection to ground truth (file systems, databases, the live web), durable memory across turns, the ability to use a browser, the ability to invoke tools, and the ability to communicate with other agents. The model became more capable in 2025-2026, but the bigger leap was wrapping it. OpenAI Codex and Claude Code are the canonical reference harnesses for software engineering — between them, "the vast majority of software tasks are now completely automated."

Palantir CTO Shyam Sankar observed the same pattern from the value-capture side in his April 2026 a16z interview:

"The models are being commoditized and are always under pressure. The model companies are expanding up. Sometimes they call it in a diminutive way a harness, but it's actually they're building software around it that is AI infrastructure to do something like code."

Shyam Sankar, Palantir — a16z American Dynamism (April 2026)

Two CTOs of two of the most consequential companies in AI infrastructure naming the same primitive in the same quarter is not coincidence. The harness is the architectural layer where 2026 value is concentrating.

Anatomy of a Harness

Harness Capability	What It Replaces in 2024 Agents	Why It Mattered
Ground-truth access	"What I learned in pre-training"	Agent now reasons over your live data, not stale weights
Browser tool	Hardcoded API integrations only	Agent can fill any web form, navigate any site
Persistent memory	Forgot everything every session	Multi-day projects, recurring user state, audit trails
Sub-agent dispatch	Single linear loop	Planner → executor → critic patterns, parallel work
Sandbox runtime	Direct OS access (or no exec)	Code execution without compromising the host
Tool registry with RBAC	Ad-hoc function calls	Auditable, scoped, reusable across agents

Harness Comparison: 2026 Reference Implementations

Harness	Specialty	Sandbox	Memory	Tool Count	Best For
OpenAI Codex	Software engineering	Hosted	Per-repo session	Built-in dev tools	Cloud-only SWE automation at scale
Anthropic Claude Code	Software engineering	Local + hosted	Project memory	Built-in dev tools + MCP	Local-first pair programming
NVIDIA OpenShell + OpenClaw	Sandboxed enterprise agents	OpenShell sandbox	Pluggable (Milvus, Neo4j)	33 extensible	On-prem regulated data work
MCP-native harnesses	Cross-tool interop	Per-server	Per-tool	Unbounded (open standard)	Heterogeneous tool stacks
Taskade Agents v2	Workspace-scoped knowledge work	Workspace boundary	Workspace DNA (durable Projects)	33 built-in + 100+ integrations	Browser-based team workspaces

The Taskade row is the application-layer harness. Where Codex and Claude Code are tuned for code, and OpenShell + NemoClaw is tuned for sandboxed enterprise data work, Taskade Genesis is the harness for everything else: project work, customer support, content workflows, internal tools, prompt-built apps. The Workspace DNA architecture is the harness primitive — Memory (Projects), Intelligence (Agents v2), Execution (Automations) — re-used across every workflow you build.

For more on the strategic implications of harnesses replacing raw model APIs, see our analysis of the Great SaaS Unbundling and Jensen Huang's complete history where the same primitive shows up at the chip layer.

Context Engineering: The Foundation of Agentic Systems

Context engineering is the discipline of designing the information environment that AI agents operate in — what data they can access, which documents they reference, what tools they can call, and how instructions are structured. The term gained traction in 2026 through Gartner research and Phil Schmid at Hugging Face, who argued that most agent failures are not model failures but context failures.

The relationship between context engineering and agentic engineering is hierarchical. Context engineering is the foundation; agentic engineering is the execution layer built on top of it.

The Vercel case study illustrates this perfectly. When Vercel's team analyzed their AI coding agent's accuracy, they discovered that removing unnecessary tools from the agent's context — giving it fewer options, not more — pushed accuracy from 80% to 100%, reduced token usage by 40%, and made responses 3.5x faster. The lesson: better context beats bigger models.

This aligns with the EPICS benchmark findings (2026), which tested frontier models on real professional tasks across engineering, product management, and customer support. The result: even the best models achieved only 24% success on authentic workplace tasks. The bottleneck was not model intelligence — it was context. Models failed when they lacked the right documents, the right tool access, or the right framing of the problem.

Each layer builds on the previous. Prompt engineering handles single-turn instructions. Context engineering designs what the model sees. Harness engineering adds pipelines, guardrails, and routing. Agentic engineering adds autonomous decision-making, multi-step execution, and human oversight loops.

Taskade's Workspace DNA implements all four layers natively: Memory provides context (documents, knowledge bases, project history), Intelligence provides agentic capabilities (AI agents with 33 tools and persistent memory), and Execution provides harness-level automation (100+ integrations with branching, looping, and error handling).

Intent Engineering: The Third Discipline

Prompt engineering taught us how to talk to AI. Context engineering taught us what AI needs to know. Intent engineering — the discipline emerging in 2026 — teaches us what AI needs to want.

The distinction matters because AI agents that succeed at the wrong objective cause more damage than agents that fail entirely. In January 2026, fintech company Klarna reported that its AI agent handled 2.3 million customer conversations across 23 markets in 35 languages, doing the work of 700 full-time employees. Resolution times dropped from 11 minutes to 2. The CEO projected $60 million in savings.

Then customers started complaining. Generic answers, robotic tone, no judgment. The AI agent was technically brilliant — optimizing for exactly the metric it was given (resolve tickets fast). But Klarna's actual organizational goal was not fast resolution. It was building lasting customer relationships that drive lifetime value in a competitive fintech market. Those are profoundly different objectives requiring profoundly different decisions at the point of interaction.

Klarna CEO Sebastian Siemiatkowski later reflected on this publicly. In a 20VC interview, he acknowledged the early approach had "too much focus on cost" and described the pivot: "The future of VIP experience will be the human connection, the relationship... We need to transform our customer service from thinking about it as just good customer service to making it the human part of what Klarna is." Klarna now recruits its most passionate customers — not outsourced call center workers — as part-time support agents through an Uber-style model, resulting in dramatically higher NPS and customer satisfaction.

The deeper lesson: Siemiatkowski also explained why Klarna could not buy customer service off the shelf. "For customer service agents, whether AI or human, to answer questions really well, they need as much context as possible. Where is that context? It's in the source code of your software." This is the intent engineering problem in miniature — AI agents need not just data but organizational context: how the company calculates interest, when to bend policy, which customers are at risk. When that tacit knowledge was never formalized, the AI optimized a proxy metric (speed) instead of the real objective (relationships).

A senior human agent with five years at the company knows when to bend policy, when to spend extra time because a customer's tone signals they are about to churn, when efficiency is the right move versus when generosity is the right move. That knowledge was never documented — it lived in tacit institutional experience. When the human agents were laid off, that knowledge walked out the door.

The three disciplines of AI engineering stack on each other:

Discipline	Era	Core Question	What It Governs
Prompt Engineering	2023-2024	How do I talk to AI?	Individual instructions
Context Engineering	2025-2026	What does AI need to know?	Information state, RAG, MCP
Intent Engineering	2026+	What does AI need to want?	Goals, values, trade-offs, decision boundaries

Intent engineering requires something most organizations have never had to produce: machine-readable expressions of organizational purpose. Not "increase customer satisfaction" (a human-readable aspiration), but structured parameters an agent can act on: What signals indicate satisfaction in our context? What data sources contain those signals? What actions am I authorized to take? What trade-offs am I empowered to make — speed versus thoroughness, cost versus quality? Where are the hard boundaries I may not cross?

This is why Workspace DNA matters at the organizational level. Memory stores the institutional knowledge that senior employees carry in their heads. Intelligence (AI agents) interprets that knowledge against live context. Execution (automations) acts within defined boundaries. The workspace becomes the intent layer — encoding not just what the agent can do, but what it should do given the organization's actual values.

Deloitte's 2026 State of AI report found that 84% of companies have not redesigned jobs around AI capabilities and only 21% have a mature model for agent governance. Meanwhile, 74% report no tangible value from AI deployments. The models work. The context pipelines are improving. What is missing is the organizational infrastructure that connects AI capability to organizational purpose.

The Microsoft Copilot story reinforces this pattern. One of the most heavily invested enterprise AI products in history — billions in infrastructure, AI embedded in every Office application — achieved 85% Fortune 500 adoption. Then it stalled. Gartner found only 5% of organizations moved from Copilot pilot to larger-scale deployment. Bloomberg reported Microsoft slashing internal sales targets. Inside companies that signed six-figure Copilot deals, employees preferred other AI tools. The issue was not model quality or UX — it was deploying AI across an organization without intent alignment. Forty thousand knowledge workers given AI tools but never told how those tools connect to what the company is trying to accomplish.

The investment behind this gap is staggering. Big tech's combined AI capital expenditure approached half a trillion dollars in 2025 and is projected to exceed that in 2026 — with the big five (Amazon, Microsoft, Google, Meta, Oracle) planning to add over $2 trillion in AI-related assets in the next four years. Meanwhile, the SWE-bench coding benchmark went from 4% AI solve rate in 2023 to approximately 90-95% saturation in 2025 — a capability doubling time that is itself shrinking. The models are not the bottleneck. The organizational infrastructure that connects model capability to organizational purpose — that is the bottleneck.

The companies that win the next phase will not be the ones with the best model subscription. They will be the ones with the best organizational intent architecture — goals, values, decision frameworks, and trade-off hierarchies that are discoverable, structured, and agent-actionable. As one analyst put it: a company with a mediocre model and extraordinary intent infrastructure will outperform a company with a frontier model and fragmented organizational knowledge every single time.

The autoresearch insight applies here too: if you give the agent a bad metric, it will very confidently optimize the wrong thing. Choosing the right metric — the one that reflects actual organizational intent, not just the one that is easiest to measure — is the skill that separates successful AI deployments from expensive failures.

For teams building with Taskade Genesis, intent engineering starts with Workspace DNA: define your goals as structured project data (Memory), train AI agents with explicit decision boundaries and knowledge bases (Intelligence), and encode your workflows with the right triggers and escalation rules (Automations). The workspace is your intent layer — persistent, collaborative, and auditable. Start building →

Agentic Engineering Platforms Compared

The agentic engineering ecosystem in 2026 spans no-code platforms, developer frameworks, and low-code automation tools. Here is how the major platforms compare:

Platform	Code Required	Multi-Agent	Memory	Integrations	Pricing
Taskade Genesis	No	Yes	Persistent	100+	$16/mo (10 users)
CrewAI	Python	Yes	Custom	Via code	Open source
LangGraph	Python	Yes	Custom	Via code	Open source
n8n	Low-code	Limited	Basic	400+	$20+/mo
AutoGen	Python	Yes	Custom	Via code	Open source

Taskade Genesis is the only platform that delivers agentic engineering without code — persistent memory across sessions, multi-agent collaboration, and 100+ native integrations out of the box. Developer frameworks like CrewAI, LangGraph, and AutoGen offer more customization but require Python expertise and custom infrastructure. n8n bridges the gap as a low-code option but has limited multi-agent orchestration.

For most teams, the right approach is Taskade Genesis for business workflows and team tools, combined with developer-focused frameworks for custom engineering projects. See our full agentic engineering tools comparison.

Get Started: Build Your First Agentic Workflow

Build Your AI Agent Workforce with Taskade. See how agentic engineering works in practice — from prompt to deployed multi-agent system.

You do not need to be a developer to practice agentic engineering. Taskade Genesis lets any team build agentic workflows in minutes:

Step 1: Create a workspace. Go to taskade.com/create and describe what you want to build. Taskade Genesis generates a living application — not a prototype, but a deployed system with a database, UI, and logic.

Step 2: Add AI agents with custom tools and knowledge. Configure AI agents with persistent memory, train them on your documents and knowledge sources, and equip them with 33 built-in tools. Browse the Prompts Library for ready-to-use agent instructions.

Step 3: Connect automations to trigger agent workflows. Set up automation workflows with 100+ integrations — Slack, email, CRM, payments, and more. Agents run on schedule, on trigger, or on demand. Explore what others have built in the Community Gallery.

This is agentic engineering in practice: you define the goal, configure the agents, set the guardrails, and let the system execute. The same pattern Karpathy describes for code applies to every workflow — plan, direct, review, test, own.

Start building your first agentic workflow →

FAQ

What exactly is agentic engineering?

Agentic engineering is orchestrating AI agents who write, test, and deploy code while you provide architectural oversight, quality standards, and strategic direction. Coined by Andrej Karpathy in February 2026, it emphasizes that directing AI agents effectively is an art and science — not just casual prompting. The five core principles: plan, direct, review, test, own.

How is agentic engineering different from vibe coding?

Vibe coding means accepting whatever AI generates without rigorous review. Agentic engineering adds five disciplines: plan before prompting, direct with precision, review rigorously, test systematically, and own the architecture. Both use AI to build software, but agentic engineering produces production-quality results.

Who coined the term and when?

Andrej Karpathy coined agentic engineering on February 8, 2026. He had previously coined vibe coding on February 2, 2025. Exactly one year later, he declared vibe coding passe because LLMs had gotten smart enough that casual prompting was no longer sufficient — orchestration with oversight was the new professional standard.

What are the five principles of agentic engineering?

Google's Addy Osmani codified them: 1) Plan before prompting — write specs and break work into agent-sized tasks, 2) Direct with precision — give agents well-scoped tasks, 3) Review rigorously — evaluate output like a human PR, 4) Test relentlessly — the single biggest differentiator from vibe coding, 5) Own the system — maintain docs, version control, CI, and production monitoring.

Do I need to be a developer to practice agentic engineering?

No. The principles apply to anyone orchestrating AI agents. On Taskade Genesis, non-developers practice agentic engineering by writing detailed prompts (planning), reviewing generated apps (oversight), iterating on designs (testing), and deploying AI agents for ongoing improvement. 63% of AI-assisted builders are non-developers.

What is the Model Context Protocol (MCP)?

MCP is an open standard created by Anthropic in November 2024 for connecting AI models to external tools and data sources. Think of it as USB-C for AI agents — a universal connector. It was donated to the Linux Foundation's Agentic AI Foundation in December 2025 and adopted by OpenAI, Google, Microsoft, and dozens of others.

What are the best agentic engineering tools?

By category: Taskade Genesis for non-developers (free tier, Pro $16/mo for 10 users). CrewAI for role-based business automation (open-source). LangGraph for enterprise orchestration. Cursor ($20/mo) and Devin 2.0 ($20/mo) for professional coding. Claude Code for terminal-based workflows. See our full agentic engineering tools comparison.

What did Gartner predict about agentic AI?

Gartner predicts 40% of enterprise applications will feature task-specific AI agents by end of 2026, up from less than 5% in 2025. By 2028, 33% of enterprise software will include agentic AI. However, they also predict over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.

What is Karpathy's autoresearch project?

Autoresearch is a 630-line Python tool released by Karpathy on March 7, 2026. It gives an AI agent an LLM training setup and lets it experiment autonomously — approximately 12 experiments per hour, 100 overnight. It demonstrates agentic engineering: human sets the goal and metric, agent executes autonomously, results are objectively measurable.

How does Taskade Genesis implement agentic engineering?

Taskade Genesis implements agentic engineering through Workspace DNA — Memory (projects as databases), Intelligence (AI agents with 33 tools and persistent memory), and Execution (automations with 100+ integrations). Users orchestrate these components to build, deploy, and maintain living software — exactly the pattern Karpathy describes. The deeper layer underneath is metacognition — agents that monitor their own uncertainty, replan when stuck, and store reflections as memory. We trace the full 50-year arc, from Flavell's 1979 coinage to Reflexion, semantic entropy, and Genesis, in the companion deep-dive: Metacognitive AI: How Agents Learn to Think About Thinking.

What is the middle loop in agentic engineering?

The middle loop is supervisory work between writing code (inner loop) and delivery operations (outer loop). It involves directing AI agents, evaluating their output, calibrating trust, and maintaining architectural coherence. Senior engineering leaders identified it as the most important emerging skill category for the AI era.

Is agentic engineering a fad or a lasting shift?

Agentic engineering represents a permanent shift. The $4.7B vibe coding market growing at 38% CAGR, Gartner's 40% enterprise adoption forecast, the Linux Foundation's AAIF, and MCP becoming the universal standard all point to structural change. The discipline of orchestrating agents becomes more valuable as AI becomes more capable, not less.

What is cognitive debt?

Cognitive debt is the gap between system complexity and human understanding — when AI-generated systems work but no human fully comprehends why. It is the agentic engineering equivalent of technical debt. Taskade Genesis reduces cognitive debt by keeping architecture visible (workspace structure), agents transparent (inspectable instructions), and history preserved.

How does agentic engineering connect to the "SaaS is dead" debate?

Y Combinator CEO Garry Tan predicted non-technical teams would vibe-code custom solutions instead of buying SaaS, naming Taskade among the disruptors. Klarna CEO Sebastian Siemiatkowski went further in his 20VC interview, arguing that AI agents will demolish SaaS switching costs entirely: "The next thing that's going to hit everyone bad is the switching cost of data... What's going to happen is people are going to start solving that problem — how do I get all my data from the existing vendor and move it to the new vendor with the help of AI through one click." On a weekend, Siemiatkowski built what he calls "company in a box" — an open-source accounting system + CRM + Claude agent that could bookkeep invoices and manage customers via natural language. The winner of the future, he argues, is not a siloed SaaS tool but something "extremely broad" — an AI-native operating system for the entire company. Klarna has already dropped Salesforce and approximately 1,200 other SaaS services, shrinking from 7,000 employees to below 3,000 through AI-driven agentic workflows. Agentic engineering elevates the SaaS debate: teams will orchestrate AI agents to build, deploy, and maintain living software that replaces over-bundled per-seat tools. See: The SaaSpocalypse Explained and Will Vibe Coding Kill SaaS?

What is the difference between agentic engineering and context engineering?

Context engineering focuses on designing the information environment for AI — what data, documents, and tools agents can access. Agentic engineering is broader: it includes context engineering plus the orchestration patterns, tool use, and autonomous decision-making that make agents useful. Think of context engineering as the foundation and agentic engineering as the full building. Taskade's Workspace DNA implements both — Memory provides context, Intelligence provides agentic capabilities, Execution automates the results.

How do I start with agentic engineering without code?

Taskade Genesis lets non-technical teams build agentic workflows without writing a single line of code. Create AI agents with 33 built-in tools, train them on your knowledge sources, wire 100+ bidirectional integrations (triggers pull events in from Slack, Gmail, Calendly, Webhooks; actions push data out to Stripe, Shopify, Notion, Salesforce, GitHub), and ship full automation workflows — all through a visual interface. Over 150,000 apps have been built this way. Start free →