I randomly clicked and scrolled through the source code of Stavrobot - The largest thing I’ve built lately is an alternative to OpenClaw that focuses on security. [1] and that is not great code. I have not used any AI to write code yet but considered trying it out - is this the kind of code I should expect? Or maybe the other way around, has someone an example of some non-trivial code - in size and complexity - written by an AI - without babysitting - and the code being really good?

[1] https://github.com/skorokithakis/stavrobot

I would suggest not delegating the LLD (class / interface level design) to the LLM. The clankeren are super bad at it. They treat everything as a disposable script.

Also document some best practices in AGENT.md or whatever it's called in your app.

Eg

    * All imports must be added on top of the file, NEVER inside the function.
    * Do not swallow exceptions unless the scenario calls for fault tolerance.
    * All functions need to have type annotations for parameters and return types.
And so on.

I almost always define the class-level design myself. In some sense I use the LLM to fill in the blanks. The design is still mine.

What actually stood out to me is how bad the functions are, they have no structure. Everything just bunched together, one line after the other, whatever it is, and almost no function calls to provide any structure. And also a ton of logging and error handling mixed in everywhere completely obscuring the actual functionality.

EDIT: My bad, the code eventually calls into dedicated functions from database.ts, so those 200 lines are mostly just validation and error handling. I really just skimmed the code and the amount of it made me assume that it actually implements the functionality somewhere in there.

Example, Agent.ts, line 93, function createManageKnowledgeTool() [1]. I would have expected something like the following and not almost 200 lines of code implementing everything in place. This also uses two stores of some sort - memory and scratchpad - and they are also not abstracted out, upsert and delete deal with both kinds directly.

  switch (action)
  {
    case "help":
      return handleHelpAction(arguments);

    case "upsert":
      return handleUpsertAction(arguments);

    case "delete":
      return handleDeleteAction(arguments);

    default:
      return handleUnknowAction(arguments);
  }
[1] https://github.com/skorokithakis/stavrobot/blob/master/src/a...

From my experience, you kinda get what you ask for. If you don't ask for anything specific, it'll write as it sees fit. The more you involve yourself in the loop, the more you can get it to write according to your expectation. Also helps to give it a style guide of sorts that follows your preferred style.

> is this the kind of code I should expect?

Sadly yes. But it "works", for some definition of working. We all know it's going to be a maintenance nightmare seen the gigantic amount of code and projects now being generated ad infinitum. As someone commented in this thread: it can one-shot an app showing restaurant locations on a map and put a green icon if they're open. But don't except good code, secure code, performant code and certainly not "maintainable code".

By definition, unless the AIs can maintain that code, nothing is maintainable anymore: the reason being the sheer volume. Humans who could properly review and maintain code (and that's not many) are already outnumbered.

And as more and more become "prompt engineers" and are convinced that there's no need to learn anything anymore besides becoming a prompt engineer, the amount of generated code is only going to grow exponentially.

So to me it is the kind of code you should expect. It's not perfect. But it more or less works. And thankfully it shouldn't get worse with future models.

What we now need is tools, tools and more tools: to help keep these things on tracks. If we ever to get some peace of mind about the correctness of this unreviewable generate code, we'll need to automate things like theorem provers and code coverage (which are still nowhere to be seen).

And just like all these models are running on Linux and QEMU and Docker (dev container) and heavily using projects like ripgrep (Claude Code insist on having ripgrep installed), I'm pretty sure all these tools these models rely on and shall rely on to produce acceptable results are going to be, very mostly, written by humans.

I don't know how to put it nicely: an app showing green icon next to open restaurants on a map ain't exactly software to help lift off a rocket or to pilot a MRI machine.

BTW: yup, I do have and use Claude Code. Color me both impressed and horrified by the "working" amount un unmaintainable mess it can spout. Everybody who understands something about software maintenance should be horrified.

In my experience its in all Language Models' nature to maximize token generation. They have been natively incentivized to generate more where possible. So if you dont put down your parameters tightly it will let loose. I usually put hard requirements of efficient code (less is more) and it gets close to how I would implement it. But like the previous comments say, it all depends on how deeply you integrate yourself into the loop.

>> They have been natively incentivized to generate more where possible

Do you have any evidence of this?

The cloud providers charge per output token, so aren't they then incentivized to generate as many tokens as possible? The business model is the incentive.

This is only true in some cases though and not others. With a Claude Pro plan, I'm being billed monthly regardless of token usage so maximizing token count just causes frustration when I hit the rather low usage limits. I've also observed quite the opposite problem when using Github's Copilot, which charges per-prompt. In that world, I have to carefully structure prompts to be bounded in scope, or the agent will start taking shortcuts and half-assing work when it decides the prompt has gone on too long. It's not good at stopping and just saying "I need you to prompt me again so I can charge you for the next chunk of work".

So the summary of the annecdata to me is that the model itself certainly isn't incentivized to do anything in particular here, it's the tooling that's putting its finger on the scale (and different tooling nudges things in different directions).

> and that is not great code

When you say "is not great code" can you elaborate? Does the code work or not?

I don't know, I would assume it works but I would not expect it to be free of bugs. But that is the baseline for code, being correct - up to some bugs - is the absolute minimum requirement, code quality starts from there - is it efficient, is it secure, is it understandable, is it maintainable, ...

So do you expect it not to be free of bugs because you've run a comprehensive test on it, read all of the code yourself or are you just concluding that because you know it was generated by an LLM?

It has not been formally verified which is essentially the only way to achieve code without defects with reasonable confidence. There are several studies that have found that there are roughly between one and twenty bugs per thousand lines of code in any software, this project has several thousand lines of code, so I would expect several bugs if written by humans and I have no reason to assume that large language models outperform humans in this respect, not at last because they are trained on code written by humans and have been trained to generate code as written by humans.

But you said "it's not great code" and then said "i don't know", so your idea of it being "not great code" is purely speculative and totally unfounded.

No, my judgment of not great code is not based on what the code does - and if it does so correctly - but on how the code is written. Those are independent things, you can have horrible code that does what it is supposed to do but you can also have great code that just does the wrong thing [1].

[1] I would however argue the later thing is more rare as it requires competent developers, however this still does not preclude some misunderstanding of the requirements.

It works really well, multiple people have been using it for a month or so (including me) and it's flawless. I think "not great" means "not very readable by humans", but it wasn't really meant to be readable.

I don't know if there are underlying bugs, but I haven't hit any, and the architecture (which I do know about) is sane.

Pine Town [1], the "whimsical infinite multiplayer canvas of a meadow", also looks like pure slop.

[1] https://pine.town/

What were you hoping to achieve with this comment?

To help readers view the content of the article with scepticism, given the results the advice in it seems to produce.

"Guys this writer must be terrible, I saw one of his books and the cover wasn't good".

It's the kind of code you should expect if you don't run a harness that includes review and refactoring stages.

It's by no means the best LLMs can do.

You can make it better by investing a lot of time playing around with the tooling so that it produces something more akin to what you're looking for.

Good luck convincing your boss that this ungodly amount of time spent messing around with your tooling for an immeasurable improvement in your delivery is the time well spent as opposed to using that same amount of time delivering results by hand.

You literally have it backwards. It's the bosses that are pulling engineers aside and requiring adoption of a tooling that they're not even sure justifies the increase in productivity versus the cost of setting up the new workflows. At least anecdotally, that's the case.

I don't disagree with that, my claim is that bosses don't know what they're doing. If all of the pre-established quality standards go out the window, that's completely fine with me, I still get paid just the same, but then later on I get to point to that decision and say "I told you so".

Luckily for me, I'm fortunate enough to not have to work in that sort of environment.

I also managed to find a 1000 line .cpp file in one of the projects. The article's content doesn't match his apps quality. They don't bring any value. His clock looks completely AI generated.

Remember you're grinding your anti-LLM axe against something a real person made, and that person read your comment.

Don't think it's fair to think any negative comment is from some anti-LLM-axe. I seriously gave you the benefit of the doubt, that was the whole reason I even looked further into your work.

It's no shame to be critical in todays world. Delivering proof is something that holds extra value and if I would create an article about the wonderful things I've created, I'd be extra sure to show it.

I looked at your clock project and when I saw that your updated version and improved version of your clock contained AI artifacts, I concluded that there's no proof of your work.

Sorry to have made that conclusion and I'm sorry if that hurt your feelings.

Saying things like "there's no proof of your work" is the anti-LLM axe. Yes, it's all written by LLMs, and yes, it's all my work. Judge it on what it does and how well it works, not on whether the code looks like the code you would have written.

Is it your work? What did you bring to the table? Because if we're going to analyze design, then code is one function of that design.

For example, you talk about how the code is secure. How do you prove that it is secure?

The same way you prove your OSS code is secure.

People here see an LLM-assisted project and suddenly they've never written a bug in their life.

I cannot empirically prove that my OS is secure, because I haven't written it. I trust that the maintainers of my OS have done their due diligence in ensuring it is secure, because they take ownership over their work.

But when I write software, critical software that sits on a customer's device, I take ownership over the areas of code that I've written, because I can know what I've written. They may contain bugs or issues that I may need to fix, but at the time I can know that I tried to apply the best practices I was aware of.

So if I ask you the same thing, do you know if your software is secure? What architecture prevents someone from exfiltrating all of the account data from pine town? What best practices are applied here?

I didn't say OS, I said OSS. Open-source software.

Fair mistake on my end, I'm aware of what OSS means but my eyes will have a tendency to skip a letter or two. The same argument applies; because if I write something and release it to the OSS community there's going to be an expectation that A) I know how it works deeply and B) I know if it's reasonably secure when it's dealing with personal data. They can verify this by looking at the code, independently.

But if the code is unreadable and I can't make a valid argument for my software, what's left?

Are you saying you know your code has exactly zero bugs because you wrote it? That's obviously absurd, so what you're really saying is "I'm fairly familiar with all the edge cases and I'm sure that my code has very few issues", which is the same thing I say.

Regardless, though, this argument is a bit like tilting at windmills. Software development has changed, it's never going back, and no matter how many looms you smash, the question now is "how do we make LLM-generated code safer/better/faster/more maintainable", not "how do we put the genie back in the bottle?".

Also I will give myself credit for using three analogies in two sentences.

You're missing the point. I don't care who wrote it, I want to know if it works.

Also, you didn't address my remarks about your clock. Can you can show me a picture of it working in action?

But you didn't say anything about wanting to know how it works, your comment was:

> The article's content doesn't match his apps quality. They don't bring any value. His clock looks completely AI generated.

I don't understand your point about proof. After more than 120 open-source projects, you think I'm lying about the fact that my clock works? All the tens of projects I've written up on my site over decades, you latch on to the one I haven't published yet as some sort of proof that I'm lying? I really don't understand what your point is.

Here: https://immich.home.stavros.io/share/_k403I3s3cON-8oL5yP_QXY...

[deleted]