On judgment, creation, and the part we're about to skip
I'm staring at three versions of a coordination system — software to manage a dozen AI coding agents running in parallel, routing work, handling failures. The first draft was elegant but the overhead would slow everything down. The second was faster but brittle at the edges. The third threaded the needle. In retrospect, the obvious choice.
I didn't write any of it. But I spent an hour on those three options — reading, marking up weaknesses, tracing failure modes, requesting specific changes. The muscle memory came back: it's the same evaluative work I did reviewing essays for friends in college. Except I'm not critiquing prose anymore. I'm evaluating software architectures, all produced by models with full context about my goals and constraints.
The shift was quiet. I spend most of my working hours as a judge now. The agents produce; I evaluate. They draft; I accept, reject, revise.
What remains is judgment.
Two things usually travel together under "creative work." Separate them:
Ideation: the Vision. What the thing should be. The internal model of excellence for whatever you're making. I know what good looks like; I'm steering toward it.
Implementation: the mechanical process of instantiating that vision. Typing the code, drafting the sentences, placing the pixels.
For some kinds of work, the two domains genuinely come apart: a coordination system either routes work correctly or it doesn't, and the architecture I steered toward existed in my head as a target before any line was typed. For other kinds of work, the separation is a lie I am telling on the page. The novelist who has spent four years on a book is not implementing a pre-existing vision; she is finding out what the book is by writing the sentence that surprises her. The vision is constituted in the prose, and there is no spec that, fed to a competent executor, would have produced it. There are plausible abstractions: voice, tone, style, high level outline, but that itself then becomes the minimal implementation. The painter is doing the same thing with paint. The mathematician, sometimes, is doing it with proofs.
So the strongest version of the objection to what follows isn't "craft is sacred." It's: you have defined creation by the artifact, and for some practices the artifact is not the goal — it is the trace of an inquiry that exists nowhere else but in the medium of its making. That objection is partially correct, but it misses the point that the content of the artifacts can be expressed and captured in mediums accessible to thinking entities in a far more condensed form than the artifacts themselves, independent of the artifact.
Judgment is now functionally sufficient for creation — where the vision really can precede the artifact, and where the friction of implementation was, in fact, mostly the tax you paid to express it.
That's still most of what I do — and most of what most knowledge workers do. And in that domain, something has actually changed.
LLMs haven't eliminated creation. They've made implementation optional.
You describe what you want, receive a version, share more ideas. I do this for code (agent orchestrators, distributed systems, API designs), prose (blog posts, technical docs, grant proposals, research papers), and arguments (architectural decisions, security tradeoffs, design critiques). The pattern is the same. The evaluative capacity — holding a standard and steering toward it — is now enough to get the thing made.
Being a competent judge is now functionally sufficient to produce the artifact, for a large and growing class of work. The gap between recognizing quality and producing quality has narrowed to where, in those domains, the distinction is losing operational meaning.
The obvious objection: this has always been true for those with resources. A wealthy patron in 1850 could commission a symphony without knowing how to compose one. Nothing new here.
But that framing mistakes what changed. The 1850 patron still needed a composer between them and the artifact — someone whose judgment filled in everything their taste couldn't specify. "Something like Beethoven's 7th but more melancholy, shorter, resolving in D major" is still a description, not a symphony. The gap between what you can appreciate and what you can specify was real, and it required another person to close it.
What has changed is not that this gap has gradually narrowed. Natural language descriptions of what you want and the artifacts themselves — code, prose, arguments — now travel through a single pipeline that can move you between them directly. That is not an incremental improvement in tools. It is a phase change for the class of work where the pipeline holds: two previously separate domains — the intended and the made — have become continuous enough that the passage between them no longer requires a human intermediary.
One way to phrase what we've built: a map between natural language and the underlying substrate, for an awful lot of things. Natural language covers almost everything we care about (the hard problem of consciousness is a notable exception, to some extent). Where the map is good, what you can say is what you can make.
The practical consequence: for a large and growing class of digital work, you can now move from "I know what I want when I see it" to "here is the thing." Not perfectly. Not every time. But regularly, consistently, with a reliability that makes the remaining gap a matter of degree rather than kind. The patron of 1850 could not do this. The difference is not one of scale or resources. It is categorical.
Most activities are autohyponyms; when we refer to the skill of "creative writing" or "computer science" or "philosophy," we mean something more specific than the activity of doing it. The skill you develop in a writing workshop isn't typing sentences — it's recognizing when they work. You learn to write by learning to judge writing. You learn to code by recognizing good code.
This sounds wrong. Practice, practice, practice, as they say. But consider the mechanics of practice. Most of what you produce early on is either blatantly broken or competent pastiche. The interesting work — the work worth dwelling on — is your flawed draft, complex enough to be worth fixing. You revise what you wrote yesterday. You refactor the function you shipped last week. The craft develops through evaluation as much as through initial production.1
I spent years in philosophy seminars. The professors weren't teaching me to type academic prose — they were training my taste for arguments. What's a strong objection? When does an analogy illuminate versus obscure? Where's the undefended assumption? The syntax and citation formats were just the price of entry. They were obstacles you had to implement yourself because there was no other way to get the thing made.
Now that price has dropped to near zero. The skill that remains — the critical eye, the taste, the judgment — turns out to have been the point, for the kinds of work this argument is about.
Yesterday I spent four hours working with models on a technical design document. The workflow: I describe the architecture I'm imagining. It generates a draft. I mark it up — this section's too vague, this example doesn't land, you're not addressing the obvious objection. It revises. I push harder. It surfaces an alternative I hadn't considered. We go back and forth.
It is adversarial and sculptural at once. I'm fighting to align the output with my Vision. But it's also raw material that talks back, offers grain and resistance, reveals surfaces I didn't know were there.
Karl Popper argued knowledge doesn't progress through acts of pure creation — it progresses through criticism. Imre Lakatos went even further in math. You conjecture, test, refute the parts that fail, iterate. The refutation is the step forward.
Popper described communities of researchers, because conjecture and refutation were expensive enough to require a division of labor. No single mind could generate fast enough to keep the critical faculty fully occupied. When the cost of conjecture approaches zero, the structure collapses into a single person. I conjecture through the (suite of) LLMs, refute through my own judgment, and the artifact converges on something better than either of us would produce alone.
The throughput shift is real. I can produce in hours what used to take months. The bottleneck isn't generation anymore — it's my capacity as critic. And the shift is not only speed. When implementation gets cheap, what is worth attempting changes too: work I would have passed on — not because it wasn't valuable, but because it wasn't valuable enough to justify what it would have cost — becomes reachable. I work on things that would not have made the cut at my previous bandwidth.
Popper was describing argument, which has a target outside itself: the thing being argued about. The dialogue maps cleanly onto argument and onto a lot of design work. It maps less cleanly onto the kind of writing where the draft is the question. I am not claiming the sparring partner replaces every relationship a person can have with a medium. I am claiming it does real work in the kinds of work it does — and the kinds of work it does keep widening.
What I've cut out, in those domains, isn't creation. It's the mechanical middle — the substrate. The nuances of rhetoric for ideas. The subtleties of syntax for function.
Richard Sennett's The Craftsman argues that thinking happens through making — that the struggle with materials, constraints, failure is what develops judgment in the first place. He traces how Renaissance goldsmiths learned through material resistance. The gold doesn't behave the way you expect; it teaches you something about grain and stress and form that no amount of observation could convey. The hand teaches the mind.2
Michael Polanyi's stronger claim is structural, not anecdotal: explicit knowledge is logically dependent on a tacit substrate that cannot be fully made explicit even in principle. The bicycle example is the textbook gloss, but the underlying argument is that articulated knowing always rests on knowing-how that the knower cannot fully say. You don't just learn balance by riding; the riding is the only place that knowledge exists.3
Matthew Crawford extends this into a moral argument in Shop Class as Soulcraft. Manual work has cognitive depth that gets lost when we optimize it away. The motorcycle mechanic develops a feel for how systems fail — an intuition built through hundreds of encounters with friction and failure — and that intuition is constitutive of the work being worth doing, not just useful for getting the job done.4
The cartoon version of the objection — "some skills require practice" — is easy to dispatch and isn't actually what these three argue.
Their argument: judgment is not a stand-alone faculty that, once trained, can be pointed at outputs from any source. It is a relationship to a medium, built by being changed by that medium's resistance. The apprentice who has cut a thousand dovetails doesn't have shallow taste compared to the master; he has taste pointed at the right variables. He evaluates joint geometry because the wood taught him joints fail in year fifteen. He evaluates proportion and finish too, but the proportion-and-finish layer is the part a generative model can fake; the joint-geometry layer is the part it can't, and the layer you only see if you've been there. Judgment without making isn't a thinner version of judgment with making. The craft tradition is right about that, in physical domains.
In digital work, the picture divides. There are two axes. On the rigor axis — does the code run? do the tests pass? is the proof valid? — the visible surface is the test, and judgment does exactly that: verification is external to the making, and a competent evaluator can supply it without having written the code. On the significance axis — is it well-written? elegant? beautiful beyond merely working? — the picture is murkier. A proof can be valid and inelegant; code can pass every test and still be unmaintainable. Whether taste for significance develops through evaluation the same way it develops through making is the craft tradition's live claim, and it is where I am betting rather than knowing.
The bet against this objection is inductive. The history of "only humans can do this, because the relevant intuition is embodied and resists formalization" is a sequence of confident claims followed by revisions — chess, then go, then perception, then natural language. In each case, the pattern that defeated the holdouts wasn't a new theory of intuition. It was scale and exposure inside an environment where the system could get enough signal to learn what counted as good.
The pattern has a name. Rich Sutton, one of the founders of reinforcement learning, called it the bitter lesson: general methods leveraging computation eventually beat methods that bake in human-structured knowledge. The lesson is bitter because it keeps proving the same thing — the structure we thought was irreducible wasn't.
But the bitter lesson is not a structural proof, just like the craft counterargument. They are inductive observations. The conditions the bitter lesson observed are specific: a defined objective, abundant compute, and the ability to generate or collect data at the scale the system needs. Chess had all three. So did ImageNet. So, eventually, did language. Open-world physical craft does not have them yet, not cleanly, and many researchers will tell you that the bitter lesson hasn't yet transferred to environments where reward is sparse and the world won't sit still long enough to be sampled. We keep observing the conditions get met, in domain after domain. I've made a personal rule: never declare a class of intelligent work permanently beyond the reach of artificial systems.
This doesn't refute Crawford's objection. It's a delay, and a delay whose length I do not know. In domains where the conditions are met, the gap between recognizing quality and producing it has narrowed enough that judgment now suffices to create. In domains where the conditions aren't met — where the medium resists, where reward is sparse, where the work is constituted by the inquiry that produces it — the craft objection still cuts, and may keep cutting for years or decades. But every day the boundary is being redrawn rather than dissolved — outward, fast. And the people who keep their hands on materials they care about are not making a mistake. They are, depending on the domain, either preserving the only path to the judgment that domain rewards, or hedging against a timeline that is much less certain than anyone knows.
It could very well prove that the people who only ever judged outputs from systems will find that their taste was pointed at the wrong layer — the visible one, the one the systems were best at faking. That's not an empirical experiment I can wait five years to read out, because by the time the data lands the cohort that learned only through evaluation has already learned what it's going to learn, while my own taste corrodes and rusts, and the muscle memory I have from years of writing without help isn't transferable to them. They will get whatever depth their workflows actually build. I don't think it will be zero. And it has advantages, too; the throughput is unbelievable — it might compensate. In any case, it will differ from what comes from the productive struggle, in ways the productive-struggle tradition has been pointing at for a long time.
I'm not going to pretend this is fine. It is the cost of the move I am making, and the move I am still making.
This all sounds plausible for code and prose. What about a chair?
The furniture maker works in materials. Wood has grain and resistance; joinery requires a feel for tolerance; finish responds to humidity in ways that resist specification. This is Crawford and Sennett's home ground — embodied knowledge that appears immune to the kind of automation I've been describing.
CNC mills, robotic assembly, and automated finishing have eaten significant portions of furniture production — the side of the trade that turns dimensioned stock into IKEA-grade product. They have not eaten furniture-making, the practice where a person commissions a piece, the maker confronts the particular board, and the design moves in response to what the wood actually does. CNC is not AI; for now it runs G-code a person wrote, on stock a person dimensioned. The judgment calls that arise mid-process — this board cupped, that knot is closer to the joint than the drawing showed, the humidity tonight is wrong for glue-up — are not edge cases. They are the substance of the work. Two-handed dexterity in unstructured material is still a research problem in 2026, not a product.
The trajectory is real, but it is slower in physical craft than the digital experience makes it feel, and the part being automated is not the part the craft tradition was talking about. The part the tradition was talking about — the dialogue with the particular piece of material, the judgment that lives in the maker's hands — is exactly the part that has resisted longest. If you are a young person deciding whether to learn to cut joints (metaphorically or literally), I am not telling you that decision is obviously wrong. I am telling you that the equivalent decision in natural and machine language has been overtaken, and that the trajectory points the same way more slowly in yours. But that doesn't mean you benefit from developing the taste.
Anthropic CEO Dario Amodei said publicly: "Claude is playing this very active role in designing the next Claude... the vast majority of future Claude code is being written by the large language model itself." What's being claimed is that LLMs write a large share of the code that builds the next model. The research taste — what to train on, what loss to use, when to kill a run — is still overwhelmingly human-driven (though shrinking). The pattern I'm describing is exactly that asymmetry: production is increasingly LLM-driven, judgment about what to produce is not, and the leverage of a single tasteful researcher has gone up commensurately.
And the cadence is the tell. GPT-5.5 shipped April 23; Opus 4.8 and Fable 5 followed that spring — Fable suspended for foreign nationals on June 13 by U.S. export-control order, effectively banned for being too capable to freely ship. More frontier shifts than anyone bothered to count.
Andrej Karpathy calls the limit of this trajectory autoresearch: an LLM in a feedback loop, improving its own performance on some validator, fully autonomously. That framing undersells the endpoint. Autoresearch is a step toward recursive self-improvement — the regime where AI systems build more capable AI systems without human taste in the loop at all. When that closes, the argument of this essay becomes a footnote, the way arguments for the indispensability of the telegraph operator became footnotes after 1920. We are not there. The models don't need to surpass human researchers one-to-one to get there — they move faster and scale wider, which compensates across every dimension that matters at the margin. We are in the regime before it: humans as judges, models as producers, with the producer share rising and the judgment share concentrating into fewer, more leveraged hands. That regime is enough for the argument of this essay.
This is what frontier AI development looks like. The creator/judge asymmetry hasn't just appeared in individual workflows. It has appeared at the most technically demanding creative work humans currently do.
P vs NP is one of the most important open questions, full stop. Not of computer science — of any field. It asks: if I can easily check a solution to a puzzle, is that puzzle easy to solve?
In 2006, Scott Aaronson argued that P probably doesn't equal NP on philosophical grounds. If it did, he wrote, "the world would be a profoundly different place than we usually assume it to be. There would be no special value in 'creative leaps,' no fundamental gap between solving a problem and recognizing the solution once it's found."
He gave examples: "Everyone who could appreciate a symphony would be Mozart; everyone who could follow a step-by-step argument would be Gauss; everyone who could recognize a good investment strategy would be Warren Buffett."
This was meant as a reductio ad absurdum. The conclusion is absurd, therefore P≠NP. Aaronson's reasons for believing P≠NP are not just philosophical but structural, grounded in decades of complexity theory. Nothing I've described here is evidence against that conjecture.
Here's what I think has actually happened. We haven't solved P versus NP. We have, instead, supplemented human intelligence with systems that compress an enormous amount of human knowledge into something a single person can move through — and the practical distribution of creative problems humans care about turns out to be far more compressible than the worst case. That is a fact about culture as much as about computation. LLMs don't efficiently search worst-case solution spaces. They make the typical distance between recognizing quality and producing it, in certain domains, something a single mind can cross.
It's lossy. It's domain-dependent. And it breaks exactly where you'd expect — in the places where the search space resists compression, where the solution can't be found because nothing like it has been produced before.
None of this touches the conjecture. But the reductio lands differently than it did in 2006. Aaronson described a world where appreciation was sufficient for creation and called it absurd. We are moving toward that world unevenly, not by proving anything about complexity classes but by supplementing our intelligence until the practical distance shrinks. The absurd world is becoming, in patches, the actual one.
Everyone who can appreciate a symphony isn't Mozart. But the distance between the two has, in the domains where this works, become a distance a person can walk. And the question worth sitting with is whether implementation, in those domains, was ever the point — or whether it was the tax we paid to exercise judgment, mistaken for the thing itself because the bill came due so often.
For the domains where the answer is no — where the writing is the thinking, where the hand teaches the mind — the tax was the work, and a generation that doesn't pay it will be doing something else than what the previous generation called creation. I am betting that's a smaller set than the tradition believes. I am not betting it's empty.
For a deeper dive into how feedback shapes learning, see Anders Ericsson and Robert Pool's Peak (2016) on deliberate practice. The neuroscience of generative replay during memory consolidation — see van de Ven, Siegelmann, and Tolias (2020) on brain-inspired replay in continual learning, and the broader Wilson/McNaughton hippocampal-replay literature — suggests revision-heavy learning paths may be neurologically nontrivial, though "neurologically equivalent to production-heavy paths" is a much stronger claim than the current evidence supports.
↩Sennett, The Craftsman (2008), particularly the sections on the medieval goldsmith and the modern glassblower. The argument extends beyond manual crafts — he applies it to programmers, doctors, and musicians.
↩Polanyi, The Tacit Dimension (1966). The bicycle example is the famous one, but the structural claim — "we know more than we can tell," and explicit knowledge logically depends on a tacit substrate that cannot be fully articulated — is the load-bearing version. This is the hardest version of the craft objection to the essay's central claim: if the Vision itself is constituted partly by tacit practice, then "judgment" as I've described it may not be the separable faculty the argument requires. Also worth reading: Andy Clark's Supersizing the Mind (2008) on the extended mind thesis. If minds extend to tools, and tools now produce, where is the boundary between thinking and making?
↩Crawford, Shop Class as Soulcraft (2009). Also see Hubert Dreyfus's five-stage skill acquisition model in What Computers Still Can't Do (1992) — the model argues that expert-level performance emerges through embodied practice that cannot be shortcut: novices work from rules, experts operate from holistic situational recognition built through thousands of material encounters. The question this essay does not fully answer is whether the evaluation-heavy workflows I'm describing can build the same kind of holistic recognition, or whether they produce a different, shallower form of expertise pointed at the visible surface of outputs.
↩