I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.
I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.
Different people just have different concepts of what's garbage and what's not.
There seems to be some kind of AI hysteria going on, with people becoming so enamoured with the AI that they accept anything it produces as if it's some gift from the gods, while others just reject it prima-facie.
For example, the worst design I have seen recently was from a designer who pivoted into "vibe coding influencer". The worst code is from developers who were heavily into Clean Code a couple years ago and now half their PRs is unused dead code.
“One man’s trash is another man’s treasure.” takes a new meaning in today’s agentic coding world.
I had good experiences doing multi-hour refactoring/housekeeping tasks that basically consisted of applying the same steps and rules n times.
Worth noting, a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals. It’s not the agent sputtering out code for eight hours straight.
And naturally I spend more time on manual verification in the end as much less of it is happening during the coding process.
> ... applying the same steps and rules n times
I do this too, with a document written for this purpose.
> ... a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals.
That is a good point. I'm mostly using C, which seemingly compiles in O(1) time, so I could imagine a large C++ or Rust codebase taking much longer to iterate simply due to compilation times.
What do you mean by C compiling in O(1)? Is that what the LLM told you?
It's a joke about how fast it compiles. whoosh
> that basically consisted of applying the same steps and rules n times.
Why use a non-deterministic, possibly hallucinatory, definitely expensive, LLM when it sounds like a codemod is the perfect solution for this?
In this case, handling all the edge cases and variants, and testing a codemod, would have taken significantly more of my time, which costs quite a bit more than the LLM.
Obviously, a deterministic tool is preferable in general, but it is not always worth bothering with for a one off task.
I usually make the llms do that part for me. Instead of asking the llm to refactor, ask it to write the codemod script that'll refactor, have it test that script, and even have it run it on its own. It's definitely faster and less error prone that way for me.
In that case, your original description of "basically consisted of applying the same steps and rules n times" was misleading.
The money people spend on things I could probably do with an emacs macro...
Your time to create that macro ain't free.
Neither is your time writing that prompt. When people are talking about elaborate prompts, with a lot of detailed instructions, guardrails etc. I'm kind of assuming it takes time.
How about coding an emacs macro with your agent?
I actually don't have any representation at the moment..
Clear winner's circle. Clear objective. Clear scope.
Clear evaluation function for an objective metric if they are making progress or regressing.
Evaluation function is computed, not llmed.
Ontology of potential actions clearly specified.
Accurate inventory of the current status qou.
Clear enumeration of options from status quo towards the winner's circle.
Waypoint objectives with similarly concrete evaluations of pass/fail, or on target off target.
It's the same thing when leading a large organization to actually hit a goal. There's randomness every turn away from your mind, so the more constrained the options, the more likely you are to hit the target. The consequence is if you're wrong about the plan then with people you're fucked. Morale will plummet. With AIs, they are so nerfed emotionally now, you clear context and start again.
I did enjoy Sonnet 4 when they would swear randomly and become sullen or wax desperately. That would at least cause pushback against a bad plan.
Fable promised better at long running tasks.
Parent post have a goal of "..see how it will perform.."
There is nothing wrong with experimenting with something new.
If you're giving it 8 hours of stuff to create with a template (e.g. slop forking) that's not a big deal. Letting it run for 8 hours to debug a weird failure also tends to work out.
This is my fucking life at work right now. I look forward to the weekends. I've never been truly inconvenienced by shitty devs because they're often too lazy to really spam me with bad code, but now they are all free to do so. I spent so much time today writing guardrail markdown files when these people SHOULD HAVE BEEN ABLE TO REVIEW THE OUTPUT AND KNOW THAT IT WAS BAD.
It truly is the age of the 90 IQ software engineer. They've never had it better.
As if meetings weren't bad enough already, I now have to sit through an informal introduction to the model of the week and its personality characteristics and how quickly it burnt through one subscription's token allotment or whatever and the latest tweaks on the magic markdown files. Luckily I've only had a couple changes sent my way so far, which weren't much different than just getting a bug report to debug and fix myself. I will need to get into risky options gambling or something so I can go start my farm early, if it keeps going this way. Even supposing it all works correctly, I don't see how it is in any way enjoyable, satisfying, or fulfilling.
You have to build up a context, or otherwise seed the memory, to get anything useful out of these LLMs on a large or existing project.