21 Comments
User's avatar
Hitesh Joshi's avatar

I am just wondering.. if we simulate the development of sqlite over 26 years and give the LLM step by step direction of the design decision they made.... will it still produce plausible code over correct code?

Hōrōshi バガボンド's avatar

solid followup topic!

jane madden's avatar

1000% the best read i’ve seen on the utmost critical problem facing agentic programming in 2026.

Hōrōshi バガボンド's avatar

thanks! appreciate it. took a lot of time to compile and dig into

Jack Timonen's avatar

Out of interest, how long did it take? :)

Hōrōshi バガボンド's avatar

been researching this on the side for some weeks now. then, just recently, stumbled across some pieces that I think made it "pop".

Been integrating LLM's into my worklows for a few months myself now and all these little quirks and incedence you run into here and there amounted to me going on a quest of finding out what's up

Jack Timonen's avatar

Makes sense, thanks for sharing.

Jack Timonen's avatar

Would you ever consider taking on an external project? We're scoping benchmark studies.

Danila Medvedev's avatar

This is true. Written (and researched) by someone who clearly understands coding, architecture and LLMs (from both theoretical and practical standpoint).

I came here via Futurism > The Register > poor AI-rewrite of a story on Medium (by Zoom In AI).

There was an indicative comment there: "Speaking as retired cto, 50 year sw veteran. I was stunned when I typed "write a c compiler for the nand2tertis hack cpu". And it did. "Write an emulator so you can test it" so it did. "Adapt the test suite from the writing a c compiler book to run tests for this compiler" and it did. If you think AI is dumb and is not very good at writing sw you aren't looking in the right places. Think where it will be in 10 years."

So a guy with 50 years of experience (if he is not lying) can totally fail to understand what is going on here. He is also likely lying about what the LLM did.

And it's clear that the hype cycle fueled by FOMO and lack of understanding of higher level thinking distorts the decisions, strategies, etc. But we don't have a way to give more weight to expert opinions and analysis that are actually based on solid undersdtanding such as in Horoshi's case. Interesting how it plays out...

Hōrōshi バガボンド's avatar

Thanks for the warm feedback!

Yeah, it's hard not getting sucked into the current ongoing 'LLM psychosis'. It's everywhere and the numbers just get bigger and bigger. I'm just trying to find the signal within all the noise rn.

Anthropics Compiler rewrite was the same way. It looked solid but under real scrutiny it scrambled quickly. IMO the tools as they are rn are the most dangerous to the people who don't have the skillset to verify

Danila Medvedev's avatar

Also a thing to look at is FuturEval benchmarks for Metaculus measuring the ability of LLM systems to predict the future (presumably including the future of AI development, a hot topic). I guess if we only make the benchmark number go up a bit more (with the help of some more compute and billions of dollars) we will be able to solve the halting problem. :)

Hōrōshi バガボンド's avatar

Great points, both of them

The MES example is perfect. That's the type of problem where "generate code that compiles" is maybe 2% of the actual challenge. The remaining 98% is understanding process flows, equipment integration, regulatory constraints, failure modes that only show up at scale, etc. Writing code isn't the hard part here lol.

I keep finding the same pattern in my current trial runs. The model can write a retry wrapper that looks textbook-correct. But it won't add jitter unless you tell it there are 500 clients behind a load balancer. It won't add distributed locking unless you tell it there are 3 replicas. I just published an article (tho just the baseline rn really) about this here https://blog.katanaquant.com/p/the-bug-that-shipped?utm_campaign=post&utm_medium=web

Re: the retired CTO and "write a C compiler": I actually don't doubt the model did something. Tho the question is what happens next. Does it handle edge cases in type coercion? Does the emulator faithfully reproduce timing behavior? Does the test suite cover the cases that actually matter? "It produced output" is not the same as "it works" imo. But asking that question requires the expertise to verify them.

Danila Medvedev's avatar

Yes. And was just on a call discussing why the full set of MES software (microelectronics production support software) can't easily be recreated even if the government gives out billions saying "we need locally produced chips". The system thinking lens is not instilled in people enough.

Ahmet Sezen's avatar

This is an amazing article, thank you.

Kathane's avatar

Which models were used ?

All of the sources cited has used outdated models, from pre reasoning ones, like llama 2, to GPT-5 (which is 8 months old already)

Is the gap still massive or is it closing ?

Hōrōshi バガボンド's avatar

Fair Question. The rewrites git history shows heavy use of opus 45+

Some of the cited studies did test older models but also included current gen. Public evaluations feel like they are lagging a bit here.

On public benchmarks the gap is converging tho it’s less clear on real-world complex dev work imo.

My current working thesis from a follow-up experiment is that the model matters less than how you frame the work. I got wildly different results with same model but different approaches on the same codebase.

Imo at these stakes the “alpha” lies in the Domain knowledge of the operator

Noah's Titanium Spine's avatar

So much great analysis to reach the wrong conclusion

> LLMs are useful.

No, they are not.

Hōrōshi バガボンド's avatar

That's a bit absolute for my test. There's definitely lots of cool and fun stuff you can play around with ob the side just by promoting here and there. But the higher up the ladder your skillset the more diminishing returns you get. I agree that there's little to no point at all to even consider prompting them with a fully laid out spec to write a database when you're already a db expert. You'd basically have to shove all your knowledge down its throat first so it's probably easier to just go ahead and do it yourself.

Hōrōshi バガボンド's avatar

damn I really butchered that:

- test = taste

- ob = on the side

- promoting = prompting