Author here. A few people are arguing against a stronger claim than the repo is meant to make. As well, this was very much intended to be a joke and not research level commentary.

This skill is not intended to reduce hidden reasoning / thinking tokens. Anthropic’s own docs suggest more thinking budget can improve performance, so I would not claim otherwise.

What it targets is the visible completion: less preamble, less filler, less polished-but-nonessential text. Therefore, since post-completion output is “cavemanned” the code hasn’t been affected by the skill at all :)

Also surprising to hear so little faith in RL. Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.

The fair criticism is that my “~75%” README number is from preliminary testing, not a rigorous benchmark. That should be phrased more carefully, and I’m working on a proper eval now.

Also yes, skills are not free: Anthropic notes they consume context when loaded, even if only skill metadata is preloaded initially.

So the real eval is end-to-end: - total input tokens - total output tokens - latency - quality/task success

There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality, though it is task-dependent and can hurt in some domains. (https://arxiv.org/html/2401.05618v3)

So my current position is: interesting idea, narrower claim than some people think, needs benchmarks, and the README should be more precise until those exist.

Sounds reasonable to me. I think this thread is just the way online discourse tends to go. Actually it’s probably better than average, but still sometimes disappointing.

i played with this a bit the other night and ironically i think everyone should give it a shot as an alternative mode they might sometimes switch into. but not to save tokens, but instead to.. see things in a different light.

its kind of great for the "eli5", not because it's any more right or wrong, but sometimes presenting it in caveman presents something to me in a way that's almost like... really clear and simple. it feels like it cuts through bullshit just a smidge. seeing something framed by a caveman in a couple of occasions peeled back a layer i didnt see before.

it, for whatever reason, is useful somehow to me, the human. maybe seeing it laid out to you in caveman bulletpoints gives you this weird brevity that processes a little differently. if you layer in caveman talk about caves, tribes, etc it has sort of a primal survivalship way of framing things, which can oddly enough help me process an understanding.

plus it makes me laugh. which keeps me in a good mood.

Interesting point! Based on what you said, in a way caveman does save your human brain tokens. Grammar rules evolve in a particular environment to reduce ambiguities and I think we are all familiar enough with caveman for it to make sense to all of us as a common. For example, word order matters for semantics in modern english so "The dog bit the grandma" and "Dog bit grandma" mean the same. Coming from languages where cases matter for semantics (like German), word order alone does not resolve ambiguity. Articles exist in English due to its Germanic roots

Now I want to try programming in pigeon English

A pidgin is just a simplified form of language that hasn't evolved into its own new language yet. There are many English pidgins.

It's much easier to talk about how something is deficient/untested than to do the testing yourself.

The same site that complains so much about replication crises in science too...

If you want to benchmark, consider this https://github.com/adam-s/testing-claude-agent

Translation:

It joke. No yell at me. It kind of work?

Thank. Too much word, me try read but no more tokens.

> There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality,

Anecdote: i discussed that with an LLM once and it explained to me that LLMs tend to respond to terse questions with terse answers because that's what humans (i.e. their training data) tend to do. Similarly, it explained to me that polite requests tend to lead to LLM responses with _more_ information than a response strictly requires because (again) that's what their training data suggests is correct (i.e. because that's how humans tend to respond).

TL;DR: how they are asked questions influences how they respond, even if the facts of the differing responses don't materially differ.

(Edit: Seriously, i do not understand the continued down-voting of completely topical responses. It's gotten so bad i have little choice but to assume it's a personal vendetta.)

LLMs don't understand what they are doing, they can't explain it to you, it's just creating a reasonable sounding response

But that response is grounded in the training data they've seen, so it's not entirely unreasonable to think their answer might provide actual insights, not just statistical parroting.

What do you mean? It is grounded on the text it is fed, the reason it said that was that humans have said that or something similar to it, not because it analyzed a lot of LLM information and thought up that answer itself.

LLM can "think" but that requires a lot of tokens to do, all quick answers are just human answers or answers it was fed with some basic pattern matching / interpolation.

There's nothing "basic" about the several months of training used to create a frontier model.

That's a very pedantic response because either way the model cannot see or analyze the training data when it responds.

They have some ability; also, you could give them tools to do it.

https://www.anthropic.com/research/introspection

> i discussed that with an LLM once and it explained to me that LLMs...

Do you have any idea how dumb this sounds?

Do you? I have the same knee-jerk reaction, but if you think about for more than 2 seconds, LLMs at this point have, through training, read much more research about LLMs than any human, so actually, it's not a dumb thing to do. It may not be very current, though.

> read much more research about LLMs than any human

How long a response is from an LLM is going to be completely individual based on the system prompt and the model itself. You can read all of the "LLM research" in the world and it's not going to give you a correct generalized answer about this topic. It's not like this is some inherent property of LLMs.

FWIW, they also wrote down something that's so obvious you don't have to know much about LLMs to know that it's true. Even the "stochastic parrot" / "glorified Markov chain" / "regurgitation machine" camps people should be on the same page - LLMs are trained on human communication, and in human communications, longer queries, good manners and correct grammar are associated with longer, more correct and quality responses; correctly, shitposting is associated with shitposts in reply.

That much is, again, obvious. My previous comment was addressing your ridiculing the notion of discussing LLMs with LLMs, which was a fair reaction back in GPT-3.5 era, but not so today.

And yet what you are saying just isn't true in my experience.

I use speech to text with Claude Code and other LLMs and often have terrible grammar and lots of typos and stuff and it never affects the output. But if I go by what you are saying then it would only seem right that the code it outputs is more sloppy? Also the length of a response entirely depends on what I'm using for example ChatGPT always gives me a long response no matter what I ask it and the Claude app always gives short responses unless I specifically ask for something longer. This is because of how they are given instructions and is not inherent to LLMs.

this continual down-voting is not a personal thing for sure. perhaps there are crawlers that pretend to be more humane, or fully automated llm commenters which also randomly downvote.

Instead of conspiracy theories don't you think it's just likely that it was people downvoting a stupid comment?

[dead]

> Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.

The rest of what you're saying sounds find, but that remark seems confused to me.

prefix your prompt with "be a moron that does everything wrong and only superficially look like you're doing it correctly. make constant errors." Of course you can degrade the performance, question is if any particular 'output styling' actually does and to what extent.

I think they mean performance with the same, rational, task.

Measuring "degredation" for the nonsense task, like you gave, would be difficult.

Their point (and it's a good one) is that there are non-obvious analogues to the obvious case of just telling it to do the task terribly. There is no 'best' way to specify a task that you can label as 'rational', all others be damned. Even if one is found empirically, it changes from model to model to harness to w/e.

To clarify, consider the gradated:

> Do task X extremely well

> Do task X poorly

> Do task X or else Y will happen

> Do task X and you get a trillion dollars

> Do task X and talk like a caveman

Do you see the problem? "Do task X" also cannot be a solid baseline, because there are any number of ways to specify the task itself, and they all carry their own implicit biasing of the track the output takes.

The argument that OP makes is that RL prevents degradation... So this should not be a problem? All prompts should be equivalent? Except it obviously is a problem, and prompting does affect the output (how can it not?), _and they are even claiming their specific prompting does so, too_! The claim is nonsense on its face.

If the caveman style modifier improves output, removing it degrades output and what is claimed plainly isn't the case. Parent is right.

If it worsens output, the claim they made is again plainly not the case (via inverted but equivalent construction). Parent is right.

If it has no effect, it runs counter to their central premise and the research they cite in support of it (which only potentially applies - they study 'be concise' not 'skill full of caveman styling rules'). Parent is right.

[deleted]

[dead]