Essays on Man-You

プログラマ道 (puroguramā-dō)

Wed, 06 May 2026 22:30:00 +0200

プログラマ道¹.

Companion piece to the 100x post. That one was the framing argument: AI tooling is +0.99 with a wielding bonus, not a multiplier. This is the ground-level version. How I actually drive the thing day to day, what I will not delegate, and why none of this is new.

I drive Claude every day. Across Woosmap services in Python, a music app in Go and SwiftUI on the weekends (Tunes), a SNES toolchain in Python (a816, xdds), and a fair amount of 65c816 ROM hacking. The tool is real. It does not flip the table on the discipline an engineer needs. It raises the floor a notch and rewards the wielder. What the wielder is actually doing, when it works, is the part nobody writes down.

What blindness looks like

Drop Claude into a fresh codebase and watch it grep. It reads filenames, opens files, follows imports by string match, guesses at module boundaries. On a small repo this is fine. On anything real it degrades into expensive guessing. Python projects make it worse: the source of truth for a symbol’s type is often the installed package, not the repo, and the model cannot see installed packages. So it produces plausible code against an imagined API.

The fix is not a better prompt. The fix is to give the model the same thing a human gets: a language server. Once basedpyright is wired up and Claude can ask “what is this symbol, where is it defined, what type does it return”, the questions get sharper and the answers stop being invented. The model does not need to be smarter. It needs to stop being blind.

Our internal stack runs in containers, no local virtualenv, dependencies live inside the image. A model on the host sees nothing. The fix we use internally is a sidecar container that exposes a language server with access to the real installed packages, attached to the application container. Once that is in place, Claude stops hallucinating signatures. Same model, same prompts, dramatically less drift. The intelligence was never the bottleneck. The view was.

The testbed problem, named on a SNES

ROM hacking is notoriously trial-and-error². The CPU does not care about your intent. The PPU cares even less. A bug in HDMA timing is invisible until you stare at the right cycle on the right scanline, and the only way to know you fixed it is to watch the framebuffer change. There is no type system saving you. There is no test framework that ships in the box.

So I built one. kintsuki (yes, typo of kintsugi, name stuck) embeds ares (a SNES emulator) as a C library, exposes Python bindings on top, and gives me programmatic control of the emulator: step execution, trace, read memory, write memory, hook the per-frame interrupt, dump CPU and video memory, diff framebuffers. From there I write pytest cases that drive the ROM to a known state, assert on the bytes that should have changed, and fail loudly when they did not. Regression testing for ROM hacks. With kintsuki in the loop, Claude can iterate. It edits the asm, pytest runs the ROM, the snapshot comes back, the test says green or red.

This is the pattern in general. When the model gets stuck in a loop, almost always it is not a reasoning failure. It is a feedback failure. Build the testbed before you blame the prompt. A pytest case that reproduces the bug. A shell one-liner that exercises the endpoint. A snapshot test that goes red on the broken behavior. Whatever shape it takes in your domain. Once it exists, the model converges. Same iteration loop a competent human uses. Without it, you get hallucinated success.

For the SNES work I started with Mesen Lua scripts, which is the standard answer in the romhacking community. Useful for one-off probing, painful for regression testing. Lua is interpreted inside the emulator, the test harness lived outside the emulator, and the seam between them was where bugs hid. kintsuki replaced that whole arrangement with the emulator itself as a library called from pytest. One process, one language, one trace of execution. The Mesen scripts taught me what to want. Building kintsuki was admitting that the standard tool had hit its ceiling.

TDD plays well with the model for the same reason. The red test is the bar. “Done” is defined externally instead of by whatever sounds done.

Time to serve

The number I find useful for measuring whether the tool is moving anything is time to serve³. From the moment a problem is articulated to the moment the fix is in production, end to end. Not lines of code, not commits per day, not tokens consumed.

It captures the whole pipeline in one denominator. The parts the model speeds up (typing, boilerplate, first cuts) and the parts it does not (deciding what to build, the data model, the testbed, the diff review, CI, production). If the tool is genuinely net positive, the number drops. If the model is producing more code while the rest of the pipeline absorbs the cost, the number stays flat. The metric does not flatter the tool. It measures the wielder using the tool.

Worth pairing with the token economics piece from the public post. Time to serve is the team metric today. Cost per shipped feature is the same shape once tokens stop being subsidized.

What I do not delegate

This is the 道 part⁴. The places where I refuse to hand the wheel over, regardless of how confident the model sounds.

Architecture choices the model makes silently. I have shipped diffs where the model swapped a lifecycle (background task vs request-bound, lazy vs eager init, sync vs async at a boundary) without flagging the consequence. The change was not wrong-looking. It was just a different decision than the one I asked for, embedded in code that read as ordinary. That kind of thing slips past self-review. Human PR review by someone outside the loop with the model is still load-bearing.

Platform debugging. The general-purpose memory showed the rolling inventory animating correctly while the video memory stayed wrong: the data was right, the upload to the screen was not. The hypothesis (the per-scanline DMA was pointing at the wrong background layer during the only window each frame when the picture chip lets you write to it) was not in any prompt. The model coded the patch once I gave it the algorithm. It could not have formed the algorithm from a screenshot.

Data model and naming. The model defaults to plausible names that collide in app code, or to over-typing every parameter, or to stringly-typing everything. The boundary calls are mine. So is deciding when a function should not exist at all.

The model writes code. Choosing what should exist, what it is called, what invariants it preserves, that is still the job.

Getting oversmarted

The way the model wins against me, when it does, is not by being cleverer. It is by me losing track of what got built. Five tool calls deep, a refactor I did not quite ask for slipped in alongside the fix I did ask for, tests pass, diff looks reasonable. Sign off and the surprise lands a week later.

Countermove is the same rule I use on myself. Faut faire qui marche avant que c’est beau. Works for the model too. Let it go from A to B with the ugly first cut. Don’t pre-optimize the route, don’t stop it for a paint job halfway through. Then read the diff with eyes on. If on the road I see something starting to jeopardize the destination, that is the signal: I asked for something not clear enough. The drift is feedback on my prompt, not on the model.

The dog metaphor is the cleanest. Go fetch. The dog will fetch. If you said go fetch and pointed vaguely, the dog comes back with a stick when you wanted the ball. Not the dog’s fault. Throw better, or accept whatever comes back.

Discipline is two things. Keep the destination visible to yourself the whole time, so you notice when the path bends. And reread the diff. The model is fast enough that the only person who can lose track of what got built is you.

Ask Claude to review Claude

I regularly ask the model to look at its own work with fresh eyes. It catches dead branches, redundant guards, mocks that should be fixtures, signatures that drifted from the call sites.

No costume preamble, no “you are now a senior reviewer” framing. I am not a jeu de rôle grandeur nature guy. Played RPGs as video games, plenty, but I do not run my prompts like sessions at a tabletop. Why would I. In the same week I am a programmer, a CTO, a romhacker, a JS/TS dev, a reviewer, an ops guy. The hat changes with the task, no costume needed. Claude is the same. Ask it to look again, it does.

The act of generating and the act of judging are not the same operation, and forcing the second pass cheaply pulls out the obvious mistakes before they reach a human reviewer. Not a substitute for that reviewer. A filter that respects their time.

Same workflow as with a coworker

Working with the model is not that different from working with a teammate. I look for friction, usually starting with my own pain, and I build something to smooth it. My job has been building tools for developers for a long time, and it turns out Claude hits the same walls a human dev hits: bad imports, missing types, no testbed, no probe into the running system. Solve those for the human and the model gets the fix for free.

The LSP sidecar from earlier is the cleanest example. Built it because I was losing time to the model not seeing installed packages, but the same sidecar makes any human on the team more productive in the same repos. kintsuki is the same shape: deterministic snapshots so I could reason about HDMA, and the model uses them too.

You cannot build high on crappy foundations. That has been true with humans for decades and the model does not change it. If anything it makes the foundations more visible, because the model is brutal at exposing the soft spots in your dev loop.

Hard position to defend in rooms where AI is supposed to change the world. I see it as a tool. If typing on a keyboard were the programmer’s job we would already have been replaced by typists, who can produce an order of magnitude more words per minute than any of us. We were not. The keyboard was never the bottleneck. It is not the bottleneck now either.

The 道 has been written down for thirty years

The instincts in this post are not new. Two books in particular keep mapping onto what I do with the model.

Growing Object-Oriented Software, Guided by Tests⁵ is the testbed argument with the receipts. Freeman and Pryce wrote it for human teams trying to keep design honest as a system grows. Same instincts hold when the author is a model: start from a failing test, let the test shape the interface, refactor under green. Tests as the design surface, not just the safety net.

The Pragmatic Programmer⁶ is the other. Hunt and Thomas have a chapter called Programming by Coincidence that names the failure mode I push back on hardest with the model: code that works without the author understanding why. Their tracer bullets idea is the same instinct as faut faire qui marche avant que c’est beau, fire something end to end first, see where it lands, adjust aim.

Both books predate the model by decades and apply to it without modification. The 道 was written down a long time ago. The model is the newest student in the dojo, not the founder of a new school. If a tool reframes the discipline so completely that the canon stops applying, the canon was wrong. So far, the canon is fine.

What I tell people who ask

When the model and I disagree, I am right more often than not, and the times I am wrong it is because I delegated something I should have owned. That is the calibration. Not trust the model, not distrust the model. Trust your own ability to recognize when the output is wrong, and treat the model’s confidence as decoration.

プログラマ道¹ is the same job it has always been. The tools changed. The discipline did not.

Programmer plus 道, the way or path. Same suffix as 柔道 (jūdō, the gentle way), 剣道 (kendō, the way of the sword), 茶道 (sadō, the way of tea). Practice you keep refining, not a credential you finish. ↩︎ ↩︎
SNES vocabulary used in this section, glossed once: CPU is the 65c816 main processor. PPU is the picture processing unit, a fixed-function chip that composes the picture from sprites and tile layers (not a GPU in the modern sense, no shaders, no general compute). WRAM is general-purpose memory the CPU writes to. VRAM is the PPU’s separate memory, only writable through narrow windows each frame. HDMA is a per-scanline DMA mechanism that lets you change PPU registers during display. NMI is the non-maskable interrupt that fires once per frame, the standard window for VRAM updates. BG3 is one of the four background layers the PPU composes. The point of the article does not depend on the details, but the jargon is real and so are the bugs it produces. ↩︎
Not sure who coined “time to serve”. Closest canonical sibling is DORA’s lead time for changes (Forsgren, Humble, Kim, Accelerate). I have been using time to serve internally without a clean attribution. If you know the source, tell me and I will update this footnote. ↩︎
道 on its own. The way. What stays yours after the tooling moves under your feet. ↩︎
Steve Freeman and Nat Pryce, Growing Object-Oriented Software, Guided by Tests, Addison-Wesley 2009. The “guided by tests” half is the part that ages best. Tests as the design surface, not just the safety net. ↩︎
Andy Hunt and Dave Thomas, The Pragmatic Programmer, Addison-Wesley 1999, 20th anniversary edition 2019. Programming by Coincidence and Tracer Bullets are the two chapters that map cleanest onto working with an LLM. ↩︎

The 100x engineer, and the unit nobody defines

Mon, 04 May 2026 01:03:42 +0200

The 100x engineer is back in the discourse. This time the pitch is that an engineer with the right AI tooling can do the work of a hundred. Sometimes the number is 10x, sometimes it is 100x, occasionally someone goes wild and writes 1000x. The number is always round and the unit is always missing.

What are we counting?

If a 100x engineer using Claude can do 365 days of work in 3.65 days, whose 365 days? A staff engineer shipping platform infrastructure? A junior wiring up CRUD endpoints? My grandmother on a good day? The denominator is doing all the work in that sentence and nobody ever writes it down.

The RPG item problem

There is a cleaner way to frame this. The “100x engineer” pitch sells AI tooling as a multiplier, the way RPG loot sells itself as *100 to all stats. The math of a multiplier compounds the floor: if your stats are 0.01, a *100 item brings you to 1. If your stats are 8, the same item flings you to 800. The marketing wants you to imagine the second case. The buyer is usually the first.

What AI tooling actually behaves like is a flat-add item. Roughly +0.99 to all stats. The kid at 0.01 puts it on and lands exactly at 1.00. The senior at 8 puts it on and lands at 8.99. Same buff, applied to whoever wears it.

Except a flat-add understates one thing: the senior extracts more from the same item. They know what to ask for, recognise when the output is wrong, slot the working bits into a system the kid does not have yet. The base buff is the same; what the wearer does with it scales with their level. Closer to +0.99 base, with a wielding bonus. Still not a multiplier on stats. The buff is on judgment, not on raw capability.

When someone tells me a terminal and Claude Code made them 100x, the question is not snark. What was the floor? If the AI item moved you from “could not ship a working CRUD app in a week” to “can ship one in an afternoon,” that is the +0.99 doing exactly what +0.99 does. Not a 100x engineer. A +0.99 wearer who started at 0.01.

The 100x story flatters more than “I was very junior and my tools got better.” Same story. Tells you nothing about the ceiling, only about the floor the speaker was sitting on.

A multiplier without a base is a vibe, not a measurement.

A concrete counter-example

Tunes is the music app and backend I have been building on weekends, a Go service plus SwiftUI clients on iOS, macOS, and tvOS, replacing iTunes Match for my own library. AI has been in the loop on almost every commit. I am close to a year of weekend development and the thing is not shipped.

If 100x were real, Tunes should have been done sometime around last February. It was not. The reason is not that I was idle. Building a real product is mostly the parts AI does not compress: deciding the data model, debugging a SwiftUI animation that only misbehaves on tvOS, understanding why the Go service drops connections on a specific HLS edge case, designing the play queue semantics, redesigning them after I lived with the first version for a month.

The model writes the function. I still have to know which function to ask for. Otherwise I get a bubble sort hidden inside a request handler that takes 45 seconds for no good reason, and the model will defend it with a straight face until I push back.

AI maggots will read this and tell me it is a SKILL.md issue, that the model would have known better with the right rules file pinned to the context. Sure. It is a skill issue. Just not the kind they mean. The wielding bonus shows up here.

The token economics nobody is pricing in

The unit cost of all this is currently subsidised. Frontier model access is sold below what it costs to serve, on top of training spend that has not been amortised. Investors are paying so we get to play with cheap intelligence. That ends. It always ends.

When the price of a token starts to reflect what a token actually costs, the productivity story gets re-litigated against a less generous backdrop. The engineer who burned three thousand tokens flailing at the bubble sort and the engineer who spotted it in twenty seconds are doing very different work. Today they pay roughly the same. Tomorrow they will not. The boring metrics will show up: cost per shipped feature, cost per resolved bug, cost per regression caught before production. Management likes impressive marketing metrics. Field reality is usually a bit different.

The engineer in that scenario is in the same position as the buyer of a smart speaker whose vendor pivots, or the company shipping on top of an open source project that changes direction. Different floor, same shape: you built on something you do not own. The buyer floor and the vendor floor live in other posts. The engineer floor is here.

Real gain, yes. Not 100x, not 10x averaged across a real week. A +0.99 with a wielding bonus, applied to whoever wears the item. Sentence does not get retweeted, which is why it does not get shipped.

Anyone using the 100x number is selling something or measuring something silly. If they are selling, ask what the unit is and watch them squirm. If they are measuring, ask them to print the denominator on the next slide.

The complete machine, fading

Sun, 03 May 2026 23:14:38 +0200

My iPod 5.5g is old enough to drink alcohol in any country that has a drinking age. I bought it used, swapped the battery and the disk for an SD card mod, and it still does the one thing it was sold to do: play music. No firmware update has ever taken a feature away. No company has decided the device is too old to deserve syncing. Apple, to their credit, still ships iTunes and Music with iPod sync on macOS. But even the day they stop, the iPod will keep playing whatever is already on the disk. The original purpose has not eroded.

That is becoming rare.

Devices die from the network now

The hardware on my desk has not gotten worse. CPUs are faster, displays are sharper, batteries hold up longer than they used to. And yet the average lifespan of a “smart” thing I buy in 2026 feels shorter than what I was buying ten years ago. The thing breaks first from the network, not from the silicon.

The pattern is familiar by now:

The companion app is removed from the store, or rewritten in a way that drops support for older models.
The TLS certificate the device was pinned to expires, and the firmware that would let it accept a new one was never updated.
The cloud API the device talks to is versioned out of existence. A v1 endpoint becomes v2, and v1 is shut down because keeping it alive is unbillable maintenance.

A speaker, a thermostat, a camera, a “smart” anything. None of these failure modes have anything to do with whether the device can still do its job. They have to do with whether anyone, somewhere else, is still willing to keep the other end of the wire alive.

AI accelerated the pattern without being its cause. Every product launch in the last two years has the same slide: “Powered by AI”, which usually means an API call to a cloud LLM the vendor pays for, the user does not control, and the headline feature does not work without. Cut the connection and the feature is gone. Often the basic feature is gone too, because the product team built the whole flow assuming the cloud would always answer.

The same dynamic plays out at every floor. The vendor depends on the open source it ships on top of, and gets squeezed when upstream changes direction (the supply-chain story). The engineer using AI tools depends on a token economy that is currently subsidised, and will get squeezed when investors want their money back (the 100x story). The buyer of the smart speaker depends on the vendor staying alive long enough to keep the cloud running. Three floors, one shape. Dependency on something you do not own is the constant. The label on the box changes.

A worse mainframe, with prettier hardware

The easy version of this argument is “we are going back to the 70s mainframe.” It misses what got worse. The 70s deal was structurally better than what is shipping now in three specific places.

The 70s mainframe sat in your building. You owned it, or your employer did, and the cable to the terminal ran through the walls of a building you had keys to. Today’s mainframe is offshore, owned by a company whose name you did not choose, reachable only over a TLS pipe whose certificates you do not control. The “central operator” is not down the hall. It is in someone else’s jurisdiction.

The 70s terminal was honestly dumb. A keyboard, a screen, a serial line, no pretence. You knew what you had. Today’s terminal is shaped like a real computer and behaves like one until you cut its connection. The dishonesty of selling a thin-client as a personal device is the part that did not exist before.

The 70s mainframe came with a service contract you signed. SLAs, support, a vendor who could be sued. Today’s “service” is a TOS the vendor can change weekly, with a clause letting them brick a device they do not like, and you signed it without reading because the alternative was that the box you paid for did not turn on.

Three ways the deal got worse: location, honesty, accountability. The mainframe analogy understates the regression.

What “complete” used to mean

You thought I was going to channel the Woz. Trying to be a bit more original.

The Casio F-91W has been in continuous production since 1989. It tells the time, runs a stopwatch and an alarm, lights up when you press the button, and lasts about seven years on a battery. That is the entire feature set. There is no companion app. The watch will keep doing exactly what it was sold to do until the LCD fails or the case cracks, on a part nobody is going to deprecate, against a service nobody operates.

A complete machine, in the sense I care about. Not “open” in the GPL sense, not “user-modifiable” in the kit-computer sense. Honest about what it is. The design refused to add things. Forty years later the refusal looks like prescience.

The iPod 5.5g is the same kind of object on a shorter timeline. Music on the disk, codec in the firmware, click wheel and headphone jack with no server in the loop. If iTunes vanished tomorrow, syncing becomes annoying, but the device still plays. Communities are already keeping that path open: Rockbox, libimobiledevice, third-party sync tools. The world where you sync an iPod without Apple exists because the device is honest about what it does.

A streaming-only speaker is the inverse. The codecs may live locally, but the catalogue does not, the auth does not, the discovery protocol does not. None of those parts belong to the buyer. The day any one of them goes away, the speaker stops being a speaker.

The Casio and the iPod 5.5g are the floor the rest of the catalogue should be measured against. Most of what is shipping in 2026 fails the comparison and sits closer to the streaming speaker.

Linux is real, and not enough

“Use Linux” is a real answer for laptops, servers, single-board computers, and a handful of phones. It is not an answer for the smart oven, the robot vacuum, the doorbell, or the infotainment system in your car. Most appliances ship as the vendor decided or they do not ship at all. The escape hatch exists. The long tail of household objects is outside it.

Practical advice for non-technical buyers: pick devices whose core function does not depend on the cloud. Pick the dumb version when the dumb version still exists. “Smart” in 2026 mostly means “rented.”

What I actually do

I work on AI tooling and run cloud services for a living. The tension is real and I am inside it.

On the personal side I lean local-first. Higgins runs a 7B model on my laptop with a local SQLite. If the cloud disappears, the assistant still answers. Costs capability, gains independence. There is also a side benefit nobody prints on the box: running the thing yourself teaches you a thing or two about how it works. The cloud version abstracts away the parts you would otherwise be forced to understand. I value both more than I used to.

The iPod will keep playing. The list of products I could say the same about in 2045 is shrinking every year.