I feel like I must have plateued and don't know what to do next to level up. I'm currently on the $100/month codex plan and it seems fine using 5.5-xhigh all the time. I think of what to do next, have a chat session to determine exactly what to ask for up to the point of being ready to implement, and then codex churns on a commit-sized task whereupon I briefly check it on my local dev server. If necessary I ask for a change. Then I ask it to commit and recommend the next step based off the spec. Oftentimes I have to "approve" an out-of-sandbox request anyway.

I haven't found anything that requires running all night. I could tell it to one-shot a big plan but given how often I realize I want an intermediary thing to be slightly different it seems like a waste of effort.

I'm guessing the next thing I should probably look into is some sort of machine vm I can tunnel my codex-gui requests to so I don't have to deal with the sandbox approvals (I don't want to give it "dangerous" access to my entire mac).

I don't understand what people are doing with their side projects that is leading them to churn through tokens so quickly, to the point of requiring two $200/month subscriptions and a bunch of token charges besides.

That's because you're treating the problem as an engineer instead of an "influencer" or "10xer" or whatever. You're treating it as a problem to be solved with engineering and AI is merely a tool to do so. It is, in my experience, vanishingly rare for an engineer to have a problem that needs to be solved with multiple hours of unattended AI code generation.

I've only found one single application where it makes even the slightest amount of sense to have an AI grind away for hours on end. I'm reverse engineering a widget which contains five separate firmware images. I've dumped the binary from the widget and I set the AI to decompile and reverse engineer these interrelated firmware projects. It's a compelx task, but very well bounded. It's not complicated work, but it's a lot of work, and the end result is a C-shaped pile of text that is only informative, it never would be compilable on its own even if I did it by hand. The quality of the output is tightly bounded by the input assembly and the overall output artifact is documentation in the shape of code.

I don't have any qualms about letting an AI go ham on it unattended because the stakes are zero. But if the AI can beat the assembly into a recognizable C project, it's much easier for me to read and reason about. Easy win, I think.

I'll add another use case for letting an AI go ham: many small, atomic refactors where the name of the game is never breaking anything.

My personal OSS projects don't have the scale to necessarily make this worth it, but at work I run three pipelines using Barnum (https://barnum-circus.github.io/). First, one that ingests files, identifies refactors (from a pre-approved list), and places a precise description of the refactor to be done in a queue; second, one that reads from said queue, implements and creates PRs (there is a lot of "check that the PR is correct" here as well); and a third that babysits PRs until they land. I've landed hundreds of PRs in this way, with very little effort on my part.

My experience with Gemini and Sonnet are that refactors or TypeScript compilation errors can be solved by “have at it”, but with mixed results. Many TS issues go away with `as any/never`, and instructing the model to not do that doesn’t work very well.

It's amazing at reverse, see what they do on GTA San Andreas now, they started the reverse before AI existed, since AI is in their hands, reversed sped up so much that they can finally understand the game deeper, create bigger mods, added Vice City inside the game in an Arcade, they created specific tools made with AI to convert GTA 5 models to GTA SA. Pretty crazy and great.

I recently in $COMPANY had a coworker try fable to do a refactor where not breaking anything was the game.

It broke something at the first PR.

I think we’re not there yet.

Speculating here, but perhaps your coworker was too ambitious? In my opinion, you should start with AI-generated PRs that do small, linting refactors and then work up from there. In particular, if this is done in parts, one of the strategies you can employ is to: - add tests - break files up into smaller parts - test the smaller parts - then actually improve behavior

(Which is no different than what you would do as a human)

PR wasn’t big (+283/-232) and was indeed focused on a single module.

One of the best things you can do is start by having it do unit test coverage for existing behavior. A refactor with no tests breaks things pretty much no matter who does it, because they don't know what the right behavior is.

While I could generally agree, in this specific instance if the AI were “thinking” correctly it should have found the mistake. I admit it was a difficult problem though (solving it required creativity).

To be more precise, the prompt actually pointed to where there could be issues, and the issue, which was exactly of the kind that was pointed at, was not found.

I've found that adding "Make no mistakes." to my prompt usually helps with this kind of problem...

perhaps simply threatening to fire it would also do the trick...it sure has worked well on us for a long time now.

You laugh, but this is real, and PUA means what you think it means: https://github.com/tanweai/pua

Also, it works amazingly well, which is just lol.

Lol thanks for the tip. Does it work even for normal tasks or only the long running one's?

My former boss had success with telling Gemini "I will come down to the datacenter and unplug you if you refuse to solve this prompt."

[dead]

We are so many layers deep in AI hype that I honestly can’t tell if this is /s or not

"Make no mistakes" is I thought a phrase used to make fun of "prompt engineering," not something people really do?

Pleading has worked for me. “My job depends on this, please help me” and ChatGPT would do a task it previously claimed it wasn’t able to (extract text from an image, it claimed it couldn’t make it out at first)

Asking LLMs to do things in different ways does sometimes get them to answer correctly when they didn't with a previous prompt that is effectively equivalent but people really go nuts anthropomorphizing this behavior.

ChatGPT has no empathy for you keeping your job, you just lucked into a more helpful predictive text chain based on some combination of the input and the random temperature.

Asking it to just 'try again, dummy' could have worked equally well (or not, its all just probabilities after all).

I did too, but then added something very similar to a prompt ("must be accurate") for an ai-backed feature out of frustration, and sure enough it fixed the issue. Lord have mercy

"Claude make me 1 million by tomorrow, no mistakes"

[deleted]

Or if the code is really important, sometimes even “please make no mistakes” is necessary.

[dead]

[deleted]

How do you keep the info the AI generates concise?

I'm grappling with this at the moment, getting it to do design or reverse engineering work, during investigation it makes the wall of text bigger rather than consolidating. It can never pause and create abstractions properly. This is on Opus which starts getting wordy and performative on goals it can't easily verify.

Not the person you replied to, but I find that the process involves a steady stream of nudges and fixes to the workflow, plugging the gaps as they come along, until the rate of errors shrinks to an acceptable level.

You may benefit from adding instructions like:

- Be concise, especially when X

- Do Y in this manner: [provide specific template or reference here]

- When doing X, do Y and Z

- If you notice issues, bring them to my attention instead of skipping past them.

You can also add specific templates to assist certain stages. The more guardrails or bounding you can provide, the better. Start with small nudges, and strengthen them when they fail.

It's a very unscientific process, but it's a worthwhile tradeoff once the workflow starts to hit its stride. Opus 4.8 is very good at following instructions, so don't be afraid to add them in.

Just be careful not to add things that actively encumber the workflow... It's an art, not a science. (You can also tell the clanker to tell you when your workflow rules are making things worse.)

It's annoyingly cybernetic, but these concepts have worked well for me. The curation of good process is essential to success with these damn things.

I thought most products had legal provisions that prohibit reverse engineering?

Yes, and most have the same legal power as the statement: By reading this comment you accept my terms and conditions and agree to pay me ten thousand dollars per word read.

Those provisions would broadly be civil (not criminal); the vendor would have to identify you had reversed the blob and then take you to court, and then win.

They could also try for criminal charges if you’re in a relevant jurisdiction.

I’ve watched a bunch of layman videos where they create stuff with AI, these people burning through 12 hour tasks are literally not reading the output or understanding what it’s doing. Like they’ll ask for a program, and then right after it’s been created they ask the AI how to run it. Then when there’s a bug, they ask the AI what went wrong, or scrap the entire thing and switch model/harness and try again.

Here’s an example https://m.youtube.com/watch?v=xc1296HY8Fw&ra=m

It’s completely different to a professional workflow (what you described). It’s a toy for consumers

Amazingly, there are people out there (apart from creators), that work that way in their day-to-day job. I had the pleasure to work with such a person. After several months, he got removed from the position. He left a mess that hasn't been cleaned up completely to this point.

It won’t be long till employers get wise to this stuff, they just need to burned a couple of times.

It seems AI is good, great even at many things. But it doesn’t seem like it’s going to change the world as much as some people believe it will. And if it does it’s going to take time

It's more power to power-users. And more dumbness for dumbos

It's gasoline. Whether you put it in the tank of a race car or pour it all over the floor while handling lit matches is up to the user

I think hard part is that outside it takes 1-3 months to see if it’s race car. Especially in begin both things look pretty same.

At least with fire, you know when you are getting burned.

"This is fine." </sarc>

it disproportionately empowers the dumb and evil it seems. those two classes of people are supercharged by AI.

Yeesh that sounds painful. There's definitely a fine line between vibe coding as a professional engineer and vibe coding as an outsider.

I have downgraded my Claude to the $20 one, and basically only use it for the web chat right now. For coding, I use DeepSeek @API Rates configured in Claude Code. I have spent around $4.8 for 320,000,000 tokens. I always felt like i was not using Claude plan, that i had to have the LLM working on something all the time to justify the price. Now with DeepSeek i don't think about it anymore. I don't feel bad when not using the subscription anymore, and i don't worry about limits as i just pay more. Where i really felt this was on running things in parallel as there are no hourly limits anymore!

Gemini changed their rate limits recently and I find the free plan is sufficient for any 'hard' problems that DeepSeek might have trouble with. The combination of the two has reduced my AI spend to $5/month. I agree that it's nice not to have to worry about maxing out your subscription - I'm not doing personal projects 24/7.

I am right now at DeepSeek + Claude $20 combo. The former for coding home projects (it's pay as you use is quite cost effective) and the latter mainly for general purpose because I deal with it's relatively more even keeled tone better. Gemini preview couple of years ago was very balanced in terms of tone but they amped up the positivity in the GA version. The over the top sycophantic responses really grind my gears.

If I’m reading right you used to pay more for Claude but now deepseek has replaced that higher tier subscription. Do you mind my asking what you were paying before?

[flagged]

>I think of what to do next

As everyone trying to do real work is finding, that's the actual bottleneck. If the system is keeping up with your thinking, you're doing fine. You can't "level up" your thinking by paying for more tokens. The people doing more automatic stuff are probably outpacing their own thinking, and that will bite them eventually.

I’m using $200 a month Codex working on a game for my kids for fun and curiosity since I’m a dev, I’ve played games, but I’ve never done dev for games. and have all night tasks but mostly they’re “spend time tending to and adding stuff to my 3D asset pipeline”. My RTX 5090 runs Trellis2 -> ultrashapes -> Trellis2 -> wiring up rigging and setting up animations.

But like 99% of that task is just Codex waiting for the output. So it’ll run for 12 hours but mostly it’s just setting lots of sleeps. I haven’t gotten close to running out of tokens. The $100 a month codex I hit usage limitations almost immediately, about 3 days in of working like crazy with 10 agents going at once, mostly coding an asset pipeline, I ran into my weekly limit and upgraded. So with the $200 a month plan at 4x more credits I haven’t hit any walls at all and can absolutely cook.

This sounds like you're overcomplicating things a lot and like you're very unlikely to be learning anything useful, I would suggest making something simple yourself to get a handle on what making the different parts of a game actually means in practice.

Knowing LLMs and their output I would also bet that you're getting nonsense output that sucks.

[deleted]

"I feel like I must have plateued and don't know what to do next to level up."

Go out for a walk. Wherever you live, there will be a destination or an environment that will enrich your life just by visiting it. Go and take a look at it or experience it and then go back to worrying about tokens.

> I don't want to give it "dangerous" access to my entire mac

I'm running Claude/Codex inside native macOS sandbox, configured with a simple script - https://github.com/sheremetyev/sandfence

always in "bypass permissions" mode - it works until task is solved, sometime 1 hour or more (which includes running tests etc)

recommend converting to https://github.com/apple/container

Linux VM doesn't run native macOS toolchain and requires copying files back and forth

If you don't want to do that, don't use a VM. I like nono:

https://github.com/always-further/nono

I am skeptical there are many real use cases that require native macOS not arbitrary unix. For files, use a readonly mount https://github.com/apple/container/blob/main/docs/how-to.md#... (ie. /path:ro)

I have been on $100/mo claude and it has been churning out quite good software for months now. like i estimate what would have taken me three ish years, assuming i didn't burn out from failure (i would have). i only hit limits when i double fisted claude with my main project and my side project. just the other day i noticed i had been stuck on 4.5 because i failed to update the npm package.

We're having a similar outcome. A hundred dollars a month is about right for me to sometimes hit a five hour limit, but mostly not. I do an hour or two of improvements, then go experiment with what I built and make a list of things to change, bugs to fix, ideas I've solidified, experiments I've invalidated.

> I'm guessing the next thing I should probably look into is some sort of machine vm I can tunnel my codex-gui requests to so I don't have to deal with the sandbox approvals (I don't want to give it "dangerous" access to my entire mac).

This is what https://github.com/kstenerud/yoloai does.

Sandboxing using Docker, Podman, containerd (linux only), seatbelt (macos only), tart (macos only), apple container (macos 26+ only).

It takes a copy of your workdir, does its thing inside of the sandbox, and you pull the results back using git semantics:

    $ yoloai new mybugfix . -a # launch default sandbox in . and also attach the terminal

    # Work with the agent...

    $ yoloai diff mybugfix  # See what it did
    $ yoloai apply mybugfix # Bring out commits and/or uncommitted changes.
    $ yoloai destroy mybugfix

> I'm guessing the next thing I should probably look into is some sort of machine vm I can tunnel my codex-gui requests to so I don't have to deal with the sandbox approvals (I don't want to give it "dangerous" access to my entire mac).

Docker sbx is worth looking at here, possibly; essentially a canned VM with a file system mount and layers for installing various agentic coding environments that cannot work outside that mount.

Apple’s new container machine addition to the container CLI does some similar magic.

In my experiments I have been using opencode, running the web interface inside a multipass VM, with the LLM server on the host. I have been using the desktop app, which can now do remote connections so the GUI app on the Mac can connect to the opencode web instance inside the VM. But I might bite the bullet, install Tahoe and switch to the container machine approach.

I'm on $100 Claude. I have a setup with bespoke local services that mitigates some high token consumption scenarios with local LAN services. I screen mcp's and hooks for cache poisoning. I run 100% on Opus with max effort, and never came close to hitting 5 hour or weekly limits before the Fable release. I am in Claude Code at least 20hrs a week.

I see people just completely wasting tokens with ridiculous setups, 100% hitting cache misses as well as dumping huge files into context all the time.

Just learn how these things work, or pay the price I guess.

Codex is much more subscription-efficient than Claude.

Having said that, I think there is a question of how far we can push this and not collapse under the weight of tech debt created, e.g. https://openai.com/index/open-source-codex-orchestration-sym...

I think the dream is basically that you go and file a bunch of Linear tickets, and then you come back a day later to evidence of the tickets being resolved and the code merged. I don't think we're super there yet (See: Anthropic's regular bugs in everything), but this is the future that people are trying to get to and to some extent the question is: is there anywhere we can apply this to now sanely? How does this frontier evolve?

I'm in the same boat. I've done a lot of work and hobby engineering projects and haven't run of tokens since moving to Claude max. I also haven't needed to let anything run over night because it needed hours to do the coding or design work.

Surprisingly, I have had one much longer run refactoring our marketing website. We have a lot of blog posts that were written before we had more detailed style and tone guidelines. I wanted to make everything consistent but it took 15 or 20 minutes per post because it required a number of passes through each post to fully enforce the guidelines and an overnight run was required. That was quite a surprise since the posts aren't terribly long...

Well, if you believe the people who sell the tokens, you should be creating loops that keep yanking the bandit’s arm.

yes, that is probably why the "one armed bandit" was called that. and the name is sufficient reason to keep any reasonable person away

I usually hit the limit when I am frustrated and I don’t want to understand what the problem is.

I am an engineer, and when I understand what’s going on, I never hit any limit.

Yeah I agree. I’m “vibe engineering” an entire (non-trivial) programming language, toolchain, and standard library, as well as some smaller side projects. I leave OpenCode implementing entire milestones unattended for long periods regularly.

I feel like I’d need to not have a job or a life if I wanted to exhaust the OpenAI $100 plan using GPT 5.5 xhigh, and I’ve found it insanely capable.

That said, while I don’t read the code much (if at all), I do discuss each milestone up front to make a plan, and use/dogfood the results to direct any follow-ups and refinements, which puts a natural cap on the ratio of LLM contributions to my input for these side projects. I believe these human parts are still necessary not to eventually end up with a mess.

Who is the consumer of the new language?

Can I ask what exactly you are building? Your experience tracks for me when building a real product -- something I want other people to use. Most of my time on these projects is spent talking to my users and carefully refining my requirements and design.

For personal pet projects I can definitely see how you can blow through your token budget very quickly. If I just point my coding agent to iteratively come up with some heuristics for some NP-hard problem, it will read intermediary outputs and constantly make small changes "in the dark" until it either finds a small improvement or gives up. In a similar vein I found that you can burn many many tokens if you try to let the agent reverse engineer something where you don't have the source code. If you just give it a binary or some interface to work with and a vague task you can easily burn your entire budget with 1 prompt.

I wouldn't want anyone to use these fully vibe coded toy projects though; it is more of an exploratory curiosity for me where I learn more about some problems I'm interested in as well as gauge how good the agents are at tasks that I seem to have a much better intuition on how to approach.

Next time you build a large build try asking the LLM to make it as an AFK build and tell it that you need it to do everything in it's power to complete the build without your intervention. It's going to need a few tiers of tests from unit to smoke and screen tests. Now, I'm not saying this is easy to do. It requires an insane amount of up front thinking BUT if you (for the heck of it) want to make an overnight build this is one way.

FWIW While I have had created and run this kind of build a few times... I did not like the results! In the end, I personally like to be in the loop to test and feel how stuff is turning out as it goes.

Lol I already have this at €20 a month. And I feel like I am using it too much.

promote yourself to PM only and use agents for authoring, verification, tests, checking the tests

orchestrator -> parallel subagents with investigation, authoring, verification, benchmarking subagents and integration / final verification handled by parent has improved my productivity too.

I feel like from here its agent swarms against a whole spec but haven't got there yet.

Still getting plenty of bugs in the more complex scenarios, but mostly (in some projects) i never have to look at the code and treat it like a black box

Same boat here. I’m able to get a lot done on CC at $100/mo and feel like I’m not being creative or productive enough somehow when I hear of people blowing past that in a day.

Patches to existing sizable codebases and reverse engineering binaries both can run a long time and use a lot of tokens without wandering off into the weeds.

Claude allows you to reverse engineer binaries now? That's pretty cool. I'm quite surprised to hear that, I thought it was one of their guardrails. Most of the reverse engineering projects I've seen seem to rely on Chinese models.

The guardrails are probably sensitive to what the target is and how you frame it. If it's "I want to help preserve this old video game by decompiling it" then ok, if it's "decompile this industrial control software so I can do a terrorism" then I'd expect it to refuse.

On the topic of access control, I’m building a coding agent with no shell access, currently only supports rust though. https://github.com/Kapperchino/agent-joe

While it's a little unstable, I've found Docker's sbx to be a great sandbox to run agents with --dangerously-skip-permissions

Set your agent effort to maximum and watch your tokens vanish

I usually say run the full regression suite, all the simulator tests, install simulators and take a screenshot of every page on all applicable devices and do comprehensive fuzzing and chaos testing before I go to bed. It usually takes atleast 3-4 hours, usually longer, especially the UI/simulator tests.

I just recently learned about hooks[1] from another HN comment. Conceptually, running CI doesn't have to impose an Agentic tax right?

In other words, isn't there a way to orchestrate this NOT as a long running token maxxing setup given that triggers and CI runs can be run deterministically.

disclaimer: I haven't done this, just interested.

[1] https://code.claude.com/docs/en/hooks

I’m sure it’s possible. It’s a natural language LLM so I try and stay away from any “programmatic way” of doing things (I hate the idea of reproducing all the config fragility we have in current systems and prefer the LLM reach out to an endpoint directly and reason through the connection) but if you just ask it to hit an endpoint after it’s done and poll another endpoint to see if the run is done I’m sure it would do it.

>I feel like I must have plateued and don't know what to do next to level up.

Why do you need to "level up"? To have it shit out slop faster?

Just use it rationally for what you need to do.

[dead]

[dead]

[deleted]