Hacker News

losvedir 20 hours ago [ - ]

Wow. This is going to be interesting to follow. There's absolutely no way any of this code was reviewed, but maybe we're in a post-human world now where you can trust the models to write and review the code. This is like Gastown but on a higher profile project. Will be fascinating to see how this project is able to add new features going forward (or even _if_ it will be able to).

Does anyone know how exactly Bun is used by Anthropic? Is it a part of Claude Code? I'm more than slightly worried about using Bun going forward myself, but I'm not sure to what extent that applies to using Claude as well.

rafram 19 hours ago [ - ]

> you can trust the models to write and review the code

You definitely cannot!

operatingthetan 19 hours ago [ - ]

Reminds me of going on linkedin and seeing all these sales and product people who are talking big game about engineering now. Well yeah they are definitely producing something but not sure I'd call it "engineering."

gmueckl 19 hours ago [ - ]

You can trust them to flag some things during review that may or may not be relevant. But just like with human review and unit testing, you cannot guarantee the absence of bugs after an LLM code review. It's just another set of (virtual) eyeballs.

rafram 16 hours ago [ - ]

I trust them somewhat to flag bugs. I don't trust them to produce clean, maintainable code - even code maintainable by the LLM itself. Any sufficiently complex LLM changeset can be assumed to contain duplicated logic, method scope creep, and code changes without accompanying documentation changes that the model often will not catch no matter how many rounds of review you run. If those issues make it into a commit, the next time you ask the LLM to update some of the functionality that it introduced earlier, bugs will creep in.

solidasparagus 14 hours ago [ - ]

I find that documentation creep is wildly better in AI coded environments than human ones. You can deterministic force a documentation sync process on every PR, documentation rot has gotten way better.

bhaak 19 hours ago [ - ]

It passed all the tests.

If you can't trust your test suite to catch an automatic language translation you shouldn't trust it at all. :)

user142 18 hours ago [ - ]

Tests can only prove the presence of bugs, but not their absence. If the AI can access the tests, it can easily make them pass by just adding additional if statements. It doesn't mean the code is actually correct.

andrewflnr 17 hours ago [ - ]

What if we only trusted the test suite a reasonable amount, instead of pretending trust must either be blindly total or nonexistent?

debugnik 18 hours ago [ - ]

It also modified many of the tests to make them pass in mischievous ways. You can't trust a test suite to catch regressions if the new version doesn't use the same test suite.

davidatbu 18 hours ago [ - ]

Do you have some examples?

davidatbu 15 hours ago [ - ]

Ah, I just learnt that you don't. Jarred's comment saying exactly that: https://news.ycombinator.com/item?id=48133806

debugnik 7 hours ago [ - ]

I'll actually concede that, on a slower skim, some changes to the test suite and fixtures that first seemed suspicious to me indeed align with what those tests were doing previously, and I wish I could retract that comment.

I still think it's not such an impressive test suite as it's being claimed; which, if this actually works out, should say more about Claude's skill than the people driving it.

davidatbu 6 hours ago [ - ]

Gotcha. I'm genuinely curious: by "impressive", are you referring to coverage? I'd be grateful if you could say a few words about it could be more impressive (e.g, if you indeed meant to talk about coverage, say what functionality/edge cases aren't covered as of now)

debugnik 4 hours ago [ - ]

Our programming languages are bad at specification and verification, so the next best thing is property-testing for modeling (e.g. Hypothesis for Python) or, for the reference implementations, extensive "expect"/snapshot test cases (e.g. Cram).

Instead, I found the bog standard suite with a single case per regression and very few actual modeling, although I wasn't expecting more. (I don't care much for JS, let alone Bun, so I can't point to features I'd like to see better tested, but I'm sure the issue tracker can do that job already.)

To be fair, our whole industry is really bad at this; most test suites are verification theatre, but now that machines can fill out implementations on their own, we should strive to properly model our requirements and limits so they can one shot what we intended. Otherwise we're left in an awkward middle in which we don't add much value over the AI fumbling around.

davidatbu 3 hours ago [ - ]

Thank you!

torben-friis 17 hours ago [ - ]

I think demonstrating broken behavior in the new build would be interesting if you have a non passing test from the original suite

solid_fuel 9 hours ago [ - ]

The entire underlying system has been replaced. The test suite is written around the current fuzzy edges and past problem areas, not every single behavior of the existing platform.

"If you can't trust your test suite to catch a hardware floating point arithmetic bug, you shouldn't trust it at all."

"If you can't trust your test suite to catch a JVM bug, you shouldn't trust it at all."

"If you can't trust your test suite to catch a recurring memory error, you shouldn't trust it at all."

data-ottawa 16 hours ago [ - ]

A wise teacher once told me a good programmer looks both ways when crossing a one way street.

torben-friis 19 hours ago [ - ]

Does anyone know how exactly Bun is used by Anthropic? Is it a part of Claude Code?

It seems to be used by anthropic as a way to shift the discussion window into it being acceptable that you yolomerge millions of lines.

darknoon 19 hours ago [ - ]

the `claude` binary is essentially a packed copy of bun + the js code, so this will replace the native runtime part of claude code.

SwellJoe 18 hours ago [ - ]

How's the test suite?