Are these synthetic or real-world benchmarks?

Answering myself: ”Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism”

Not gonna start looking for a job any time soon

Example I chose at random:

> Convert a hexadecimal number, represented as a string (e.g. "10af8c"), to its decimal equivalent using first principles (i.e. no, you may not use built-in or external libraries to accomplish the conversion).

So it's fairly synthetic. It's also the sort of thing LLMs should be great at since I'm sure there's tons of data on this sort of thing online.

Yeah but programming isn't about solving problems that were solved millions of times already. I mean, web dev kind of is, but that's not the point. If a problem is solved, then it's just a matter of implementing the solution and anyone can do that given the proper instructions (even without understanding how or why they solve the problem).

I've formalized a lot of stuff I didn't understand just by copying the formulas from Wikipedia.

As long as LLMs are not capable of proper reasoning, they will remain a gimmick in the context of programming.

They should really just focus on refactoring benchmarks across many languages. If an AI can refactor my complex code properly without changing the semantics, it's good enough for me. But that unfortunately requires such a high-level understanding of the codebase that with the current tech it's just impossible to get a half-decent result in any real-world scenario.

I use Claude for coding and it's fantastic. I definitely have outsourced a lot of my coding to it.

What's the (current) best way to integrate it? VS Code extension? Other IDE?

I'll throw this out here as well: Is there any decent alternative to GitHub Copilot when using Visual Studio? (Pretty happy with it to be fair, but would be open to trying others.)

Supermaven is really good. I am a paying user of super maven.

I use cursor (cursor.com) and it's fantastic

Fellow cursor user here, I'm very new to it. I am getting some very convenient and welcome autocomplete. I am also getting quite a lot of bad autocomplete suggestions, which require cognitive overhead and context switching to evaluate. So I am thus far not fully convinced. Any tips for getting the most out of cursor?

Huge seconding of cursor.

Aider, created by the originator of this very comment thread.

Sourcegraph Cody.

Sure it can do coding but can it do software engineering

What exactly is left when we remove coding from software engineering? Could it be handled by a manager? Or perhaps by a single senior SWE who could now perform the work of an entire team using these rapidly advancing AI coders?

for a lot of tasks that aren't as cut & dry, i often find myself having to provide it pseudo code, which it can then one-shot to working code.

don't get me wrong, it's still a massive upgrade from the pre-sonnet era, but i still don't think it can take a high-level requirement and convert it into a working project... yet

> but i still don't think it can take a high-level requirement and convert it into a working project.

It cannot, you need to hand-hold it, as in, to make something larger than a (albeit good looking) to do app, you don't need to write code , but you do need to be able to review and debug code and take the architectural decisions. It'll simply loop forever otherwise.

It’s a good question. I would ask…

(1) Sure, it can tell you how to write new code in response to a prompt about your current local problem, but

(2) can it reason about an entire code base of known and unknown problems, and use that basis to figure out solutions to the unknowns such that you delete code and collapse complexity.

The software equivalent of realising that if you subtract xy from this:

  x2 + 3xy + y2
You can turn it into a much neater version:

  (x + y)2 + xy
…but doing that with 100k tokens of code instead of a handful of algebra tokens.

I haven't had much luck with architecture stuff. Maybe I'm holding it wrong.

The new version is already in Cursor and its outstanding.

Can code at mid-level now. Almost.