> The difference between “jmp $+15” and “jmp $+16” is inscrutable
I don't see why that's the case. LLM trained on binary would totally see it, not?
Also the tool can also be running the test and a debugger.
> The difference between “jmp $+15” and “jmp $+16” is inscrutable
I don't see why that's the case. LLM trained on binary would totally see it, not?
Also the tool can also be running the test and a debugger.
> I don't see why that's the case. LLM trained on binary would totally see it, not?
It would not. You find the correct version by counting the number of bytes to the destination. LLMs are famously bad at this kind of problem (counting).
> Also the tool can also be running the test and a debugger.
The test needs to provide a good amount of signal. That’s too hard if you are throwing machine code at the wall.
In order for debuggers to work, you need some kind of model that describes what the code should do and what state the computer should be in after each instruction. That model is high-level code.
I can understand the intuitive appeal of training LLMs with machine code, but all of my experience with LLMs suggest that they are incredibly ill-suited to the task, and we just don’t have the capacity to train them to make useful machine code.
Can "LLMs are bad at counting" be generalized to "LLM are better in complex stuff but make more mistakes in simple"?
I would phrase it as "LLMs are good at big picture stuff and bad at fine detail", or to put it another way, they're accurate, but imprecise and with low reproducibility.
It is my experience that it's the opposite. LLMs are very very precise but wildly inaccurate. They might give you 17 significant digits but be off by 10 orders of magnitude, to use a metaphor.
Sounds like we're in agreement, then. The 7 digits it got correct are the big picture, and the rest are the details. Are you disagreeing with my statement or with my usage of "accurate" and "precise"?
But where does that leave us when programmers treat themselves as architects with the AI doing the drudge work? As seems to be the fashion.
It then means you have 2 parties focussing on the big picture and no one focussing on the details.
I said "big picture stuff", but I guess I should have said "broad strokes". The truly correct answer is probably similar to what the model will answer, and if your problem is such that it can work with small imperfections in a solution, then the LLM helps. If the solution needs to be exactly right, then it will probably fail.
Yesterday on a whim I tried asking a local model a question about kanji that look different in different fonts despite being the same character (to the point of strokes appearing in completely different directions), and the model hallucinated imgur links to images of the characters. If imgur could work with approximate references to data maybe that would have worked.
Its more LLMs are better at vague problems with multiple non perfect solutions, and struggle at problems that require precision.
No, I don’t think so. LLMs are good at a lot of simple tasks, but bad at certain simple tasks. Moravec’s paradox in a new iteration.
It applies to humans too. Calculus is “simple” but it takes something like sixteen years to train a human to do it, if all goes well. Meanwhile, most humans think that inverse kinematics is, like, the easiest thing in the world (it’s a super complicated task).
Calculus is definitely the harder task, considering it took a species developing the cognitive capacity for symbolic reasoning for it to show up, whereas any animal can figure out how to position its limbs. Yeah, we figured out how to make CAS programs before inverse kinematics software, but that's because computers were made to solve numerical problems, not to replace the cerebella of chordates.
> Calculus is definitely the harder task,
You’re only evaluating “harder” or “easier” based on the perspective of somebody who has a mammalian brain with millions of years of selective pressure to make it suitable for solving inverse kinematics problems.
The point here is that when we start constructing agents or tools with different architectures to ourselves, it makes sense to reevaluate notions of whether something is ‘hard’ or ‘easy’. LLMs are bad at counting not because counting is hard, but because their architecture makes it hard.
I'm evaluating them using an objective metric, which is how long each took to arise in the universe. It could have never been the case that calculus arose before inverse kinematics, because a thing like that could not interact with the real world.
Also, I suspect you're comparing dissimilar things, because in one case you're looking at a brain doing both inverse kinematics and "calculus" (sense 1), and in the other you're looking at a computer doing both inverse kinematics and "calculus" (sense 2). The kind of calculus a CAS does is not the same kind that a human does. It's less versatile, for one.
>The point here is that when we start constructing agents or tools with different architectures to ourselves, it makes sense to reevaluate notions of whether something is ‘hard’ or ‘easy’.
Well, no, because when someone says that calculus is hard and moving their arms is easy, they're not talking about how hard it was to create each functionality, they're talking about how hard it is to employ each. We would need to ask a computer how hard it thinks the tasks it does are to do.
> I'm evaluating them using an objective metric,
I don’t think the metric is at all reasonable, and the fact that it’s “objective” doesn’t make up for its other shortcomings. I don’t think we have a basis for agreement here—I think you’ve framed the argument in a way that supports a “calculus is hard” conclusion merely by defining “hard” in such a way that supports your conclusion from the start, but I think that approach is only useful as a way to win an argument, and we’ve failed to share ideas once you start using that tactic.
>I think you’ve framed the argument in a way that supports a “calculus is hard” conclusion merely by defining “hard” in such a way that supports your conclusion from the start
It seems to me you're the one who first did that by equivocating what is easier to do and what is easier to make a machine do.
>we’ve failed to share ideas once you start using that tactic
Well, I certainly don't agree with that.
Even if it could, it would be ridiculously token inefficient to update huge amount of addresses instead when some small change is done to the middle of a binary