Something I’ve noticed (I’m using Claude subscription so no refunds but it applies to usage windows) is that sometimes AI makes mistakes so if something is important I tell Claude code to spin up a couple of sub agents and verify the information, often there will be a mistake and it gets rectified.

It feels unfair I have to pay (or lose some usage) for this.

Interested in other people’s thoughts.

I have been thinking about this kind of thing recently. I've got a hobby project that generates some AI content for the user and trying to figure out the most fair way to deal with the output being just plain bad. I haven't run into this case myself yet in testing after many quality passes to make the generation robust, but have no doubt that at scale there would be junk output at some point. Users would likely be non-technical and pay per-generation, not on a subscription model. So I would like to find a way to:

- Define some threshold for bad output

- Detect when a piece of output meets that threshold (vs just maybe not being what the user expected, which in my case is just fine)

- Refund the user credits so they can generate again

Text output is relatively easy to evaluate to some base threshold of quality in the generation process, but the final output is not text... it becomes harder.

Each failed generation would be very disruptive to the user on its own (in the scope of the app's purpose), so I'm also considering offering them an extra discount on their next purchase (in addition to the credit refund).

Do I get users to report generations they consider bad and then review them somehow? Do I try to auto-detect bad output before the output is delivered to the user? Probably a mix of all of the above... while attempting to mitigate the potential for abuse (people making dummy generations and then reporting them 'just to try it out', or try to game the system to get multiple free generations). Maybe I'd have to have some sort of time window for reporting a junk generation, and a max "use" count that flags if the user actually took benefit from the output before reporting it...

I guess this turned into a bit of a brain dump.

Do you pay your employer when you introduce bugs? I think you're lucky if you get usable output which you don't consider a mistake. Also, you might be mistaken if you think that you pay for a deterministic service.

edit: typo

No, I don't think so. You've paid for a service that will run an AI model given some prompt. There have been zero guarantees made that it will actually solve your problem.

As others have stated too, how do you define what an incorrect output is?

I think it should be because sometimes it happened itself not because of my mistake.So it spends a lot of unnecessary tokens for that.

I think it would be expensive to check. For a coding task any reviewer would need to understand programming (these people aren't cheap), the domain context, cultural differences (e.g. American "cookie" vs British "biscuit"), and make a determination.

If the AI companies just paid all of that out of the goodness of their pocketbook I'd be fine with it, but in reality I think they'd just pass on the costs. The same way that basically every business passes on spoilage, theft, return rates, etc.. So I think the value would be risk mitigation rather than cost (as in, you know if you pay for $10 worth of tokens, it will $10 worth of good tokens, but the individual token cost would need to account for all the tokens that the company doesn't get paid for)

Interesting, you have just identified a potential market distinction. First we need a group (ala consumers report) to evaluate different services. Then different services would be motivated to perform the sub agent verification automatically as a Competitive Advantage,

Just parse responses for "sorry about that"!

No, you should know that no man or machine writes bug or mistake free code. You are paying for tokens (electricity and cooling), not what those tokens represent. How would you define mistakes in non code tasks?

Would you give money back to your employer when you make a mistake?

The hard part is defining what a mistake is. If you ask Claude to write code and it works, but you don't like the approach is that a mistake? If it generates a UI with the wrong colors, but everything else is correct, does that count? The amount of subjectivity alone makes it too difficult and nearly impossible for a refund system to be implemented properly.

I think so. IMO, at this point, AI systems should also be using expert/rule systems to validate their output to avoid bad/obvious mistakes. In ambiguous/complex cases, I don't think so, but in certain circumstances, the output is ridiculous and could have been caught by a relatively simple expert system/rules engine, likely something the AI itself could have helped build.

It's interesting on the grounds of aligning incentives.

It's not interesting due to the fact that it suggests humans are still in the loop of some slow-cycle improvements. That'd never get by any board. In fact, selection of model modes implies it's your responsibility, so that meal was scraped into your flowerpot years ago.

I'd say fat chance.

You've found a way to find the mistakes before they happen. I would think that would keep usage down later if the mistake was implemented.

Can't you just... tell it not to make mistakes? :-)

"Write me a function you found on Stack Overflow; make sure you check the comments on that function and ensure you're using their corrections. Also, check the 2nd and 3rd highest, incase the top comment is using the old API and the other comments actually solve the problem I'm asking for. MAKE NO MISTAKES"

I think they are already doing it case by case basis, but the support experience is worst

Anthropic definitely has some paying back to do, it chewed on 50GBP of extra credits like it was nothing.

I mean “mistakes” can be hard to define. IMHO there is an area of responsibility between the LLM, the LLM user, and the code itself.

Did it make a mistake because I didn’t follow instructions properly or hallucinated some content?

Did it make a mistake because the prompt was unclear/open to interpretation or plain wrong?

Did it make a mistake because it lacked some context? Or too much context and it starts getting confused?

Is not handling edge cases automatically when that was not requested a mistake?

I am not just trying to defend LLMs, in many cases they make obvious mistakes and just don’t follow my arguably clear instructions properly. But sometimes it is not so clear cut. Maybe I didn’t link a relevant file (you can argue it could have looked to it), maybe my prompt just wasn’t that clear etc

probably not. but they should be more explicit about the usage, not just - you've used up 5%.

LLMs hallucinate - This is known.

If you choose to use them, you go in knowing they need help to be accurate. You clearly know how to use the tools to reach the accuracy you desire, but asking for that usage to be free seems to be based on a false premise. There has never been an expectation of accuracy in the first place when it comes to LLM output.

[dead]

[dead]