Effectively everyone is building the same tools with zero quantitative benchmarks or evidence behind the why / ideas … this entire space is a nightmare to navigate because of this. Who cares without proper science, seriously? I look through this website and it looks like a preview for a course I’m supposed to buy … when someone builds something with these sorts of claims attached, I assume that there is going to be some “real graphs” (“these are the number of times this model deviated from the spec before we added error correction …”)

What we have instead are many people creating hierarchies of concepts, a vast “naming” of their own experiences, without rigorous quantitative evaluation.

I may be alone in this, but it drives me nuts.

Okay, so with that in mind, it amounts to heresay “these guys are doing something cool” — why not shut up or put up with either (a) an evaluation of the ideas in a rigorous, quantitative way or (b) apply the ideas to produce an “hard” artifact (analogous, e.g., to the Anthropic C compiler, the Cursor browser) with a reproducible pathway to generation.

The answer seems to be that (b) is impossible (as long as we’re on the teet of the frontier labs, which disallow the kind of access that would make (b) possible) and the answer for (a) is “we can’t wait we have to get our names out there first”

I’m disappointed to see these types of posts on HN. Where is the science?

Honestly I've not found a huge amount of value from the "science".

There are plenty of papers out there that look at LLM productivity and every one of them seems to have glaring methodology limitations and/or reports on models that are 12+ months out of date.

Have you seen any papers that really elevated your understanding of LLM productivity with real-world engineering teams?

> There are plenty of papers out there that look at LLM productivity and every one of them seems to have glaring methodology limitations and/or reports on models that are 12+ months out of date.

This is a general problem with papers measuring productivity in any sense. It's often a hard thing to define what "productivity" means and to figure out how to measure it. But also in that any study with worthwhile results will:

1. Probably take some time (perhaps months or longer) to design, get funded, and get through an IRB.

2. Take months to conduct. You generally need to get enough people to say anything, and you may want to survey them over a few weeks or months.

3. Take months to analyze, write up, and get through peer review. That's kind of a best case; peer review can take years.

So I would view the studies as necessarily time-boxed snapshots due to the practical constraints of doing the work. And if LLM tools change every year, like they have, good studies will always lag and may always feel out of date.

It's totally valid to not find a lot of value in them. On the other hand, people all-in on AI have been touting dramatic productivity gains since ChatGPT first arrived. So it's reasonable to have some historical measurements to go with the historical hype.

At the very least, it gives our future agentic overlords something to talk about on their future AI-only social media.

No, I agree! But I don’t think that observation gives us license to avoid the problem.

Further, I’m not sure this elevates my understanding: I’ve read many posts on this space which could be viewed as analogous to this one (this one is more tempered, of course). Each one has this same flaw: someone is telling me I need to make a “organization” out of agents and positive things will follow.

Without a serious evaluation, how am I supposed to validate the author’s ontology?

Do you disagree with my assessment? Do you view the claims in this content as solid and reproducible?

My own view is that these are “soft ideas” (GasTown, Ralph fall into a similar category) without the rigorous justification.

What this amounts to is “synthetic biology” with billion dollar probability distributions — where the incentives are setup so that companies are incentivized to convey that they have the “secret sauce” … for massive amounts of money.

To that end, it’s difficult to trust a word out of anyone’s mouth — even if my empirical experiences match (along some projection).

The multi-agent "swarm" thing (that seems to be the term that's bubbling to the top at the moment) is so new and frothy that is difficult to determine how useful it actually is.

StrongDM's implementation is the most impressive I've seen myself, but it's also incredibly expensive. Is it worth the cost?

Cursor's FastRender experiment was also interesting but also expensive for what was achieved.

I think my favorite current example at the moment was Anthropic's $20,000 C compiler from the other day. But they're an AI vendor, demos from non-vendors carry more weight.

I've seen enough to be convinced that there's something there, but I'm also confident we aren't close to figuring out the optimal way of putting this stuff to work yet.

The writing on this website is giving strong web3 vibes to me / doesn't smell right.

The only reason I'm not dismissing it out of hand is basically because you said this team was worth taking a look at.

I'm not looking for a huge amount of statistical ceremony, but some detail would go a long way here.

What exactly was achieved for what effort and how?

This was my reaction as well, a lot of hand-waving and invented jargon reminiscent of the web3 era - which is a shame, because I'd really like to understand what they've actually done in more detail.

Yeah, they've not produced as much detail as I'd hoped - but there's still enough good stuff in there that it's a valuable set of information.

But the absence of papers is precisely the problem and why all this LLM stuff has become a new religion in the tech sphere.

Either you have faith and every post like this fills you with fervor and pious excitement for the latest miracles performed by machine gods.

Or you are a nonbeliever and each of these posts is yet another false miracle you can chalk up to baseless enthusiasm.

Without proper empirical method, we simply do not know.

What's even funnier about it is that large-scale empirical testing is actually necessary in the first place to verify that a stochastic processes is even doing what you want (at least on average). But the tech community has become such a brainless atmosphere totally absorbed by anecdata and marketing hype that no one simply seems to care anymore. It's quite literally devolved into the religious ceremony of performing the rain dance (use AI) because we said so.

One thing the papers help provide is basic understanding and consistent terminology, even when the models change. You may not find value in them but I assure you that the actual building of models and product improvements around them is highly dependent on the continual production of scientific research in machine learning, including experiments around applications of llms. The literature covers many prompting techniques well, and in a scientific fashion, and many of these have been adopted directly in products (chain of thought, to name one big example—part of the reason people integrate it is not because of some "fingers crossed guys, worked on my query" but because researchers have produced actual statistically significant results on benchmarks using the technique) To be a bit harsh, I find your very dismissal of the literature here in favor of hype-drenched blog posts soaked in ridiculous language and fantastical incantations to be precisely symptomatic of the brain rot the LLM craze has produced in the technical community.

I do find value in papers. I have a series of posts where I dig into papers that I find noteworthy and try to translate them into more easily understood terms. I wish more people would do that - it frustrates me that paper authors themselves only occasionally post accompanying commentary that helps explain the paper outside of the confines of academic writing. https://simonwillison.net/tags/paper-review/

One challenge we have here is that there are a lot of people who are desperate for evidence that LLMs are a waste of time, and they will leap on any paper that supports that narrative. This leads to a slightly perverse incentive where publishing papers that are critical of AI is a great way to get a whole lot of attention on that paper.

In that way academic papers and blogging aren't as distinct as you might hope!