don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:
- telling the model to be persistent (+20%)
- dont self-inject/parse toolcalls (+2%)
- prompted planning (+4%)
- JSON BAD - use XML or arxiv 2406.13121 (GDM format)
- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD
- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
source: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...
As an aside, one of the worst aspects of the rise of LLMs, for me, has been the wholesale replacement of engineering with trial-and-error hand-waving. Try this, or maybe that, and maybe you'll see a +5% improvement. Why? Who knows.
It's just not how I like to work.
I think trial-and-error hand-waving isn't all that far from experimentation.
As an aside, I was working in the games industry when multi-core was brand new. Maybe Xbox-360 and PS3? I'm hazy on the exact consoles but there was one generation where the major platforms all went multi-core.
No one knew how to best use the multi-core systems for gaming. I attended numerous tech talks by teams that had tried different approaches and were give similar "maybe do this and maybe see x% improvement?". There was a lot of experimentation. It took a few years before things settled and best practices became even somewhat standardized.
Some people found that era frustrating and didn't like to work in that way. Others loved the fact it was a wide open field of study where they could discover things.
Yes, it was the generation of the X360 and PS3. X360 was 3 core and the PS3 was 1+7 core (sort of a big.little setup).
Although it took many, many more years until games started to actually use multi-core properly. With rendering being on a 16.67ms / 8.33ms budget and rendering tied to world state, it was just really hard to not tie everything into eachother.
Even today you'll usually only see 2-4 cores actually getting significant load.
Performance optimization is different, because there's still some kind of a baseline truth. Every knows what a FPS is, and +5% FPS is +5% FPS. Even the tricky cases have some kind of boundary (+5% FPS on this hardware but -10% on this other hardware, +2% on scenes meeting these conditions but -3% otherwise, etc).
Meanwhile, nobody can agree on what a "good" LLM in, let alone how to measure it.
there probably was still a structured way to test this through cross hatching but yeah like blind guessing might take longer and arrive at the same solution
I feel like this a common pattern with people who work in STEM. As someone who is used to working with formal proofs, equations, math, having a startup taught me how to rewire myself to work with the unknowns, imperfect solutions, messy details. I'm going on a tangent, but just wanted to share.
The disadvantage is that LLMs are probabilistic, mercurial, unreliable.
The advantage is that humans are probabilistic, mercurial and unreliable, and LLMs are a way to bridge the gap between humans and machines that, while not wholly reliable, makes the gap much smaller than it used to be.
If you're not making software that interacts with humans or their fuzzy outputs (text, images, voice etc.), and have the luxury of well defined schema, you're not going to see the advantage side.
Software engineering has involved a lot of people doing trial-and-error hand-waving for at least a decade. We are now codifying the trend.
Out of curiosity, what do you work on where you don’t have to experiment with different solutions to see what works best?
Usually when we’re doing it in practice there’s _somewhat_ more awareness of the mechanics than just throwing random obstructions in and hoping for the best.
LLMs are still very young. We'll get there in time. I don't see how it's any different than optimizing for new CPU/GPU architectures other than the fact that the latter is now a decades-old practice.
Not to pick on you, but this is exactly the objectionable handwaving. What makes you think we'll get there? The kinds of errors that these technologies make have not changed, and anything that anyone learns about how to make them better changes dramatically from moment to moment and no one can really control that. It is different because those other things were deterministic ...
In comp sci it’s been deterministic, but in other science disciplines (eg medicine) it’s not. Also in lots of science it looks non-deterministic until it’s not (eg medicine is theoretically deterministic, but you have to reason about it experimentally and with probabilities - doesn’t mean novel drugs aren’t technological advancements).
And while the kind of errors hasn’t changed, the quantity and severity of the errors has dropped dramatically in a relatively short span of time.
The problem has always been that every token is suspect.
It's the whole answer being correct that's the important thing, and if you compare GPT 3 vs where we are today only 5 years later the progress in accuracy, knowledge and intelligence is jaw dropping.
I have no idea what you're talking about because they still screw up in the exact same way as gpt3.
> I don't see how it's any different than optimizing for new CPU/GPU architectures
I mean that seems wild to say to me. Those architectures have documentation and aren't magic black boxes that we chuck inputs at and hope for the best: we do pretty much that with LLMs.
If that's how you optimise, I'm genuinely shocked.
i bet if we talked to a real low level hardware systems/chip engineer they'd laugh and take another shot at how we put them on a pedestal
Not really, in my experience. There's still fundamental differences between designed systems and trained LLMs.
most people are building straightforward crud apps. no experimentation required.
[citation needed]
In my experience, even simple CRUD apps generally have some domain-specific intricacies or edge cases that take some amount of experimentation to get right.
Idk, it feels like this is what you’d expect versus the actual reality of building something.
From my experience, even building on popular platforms, there are many bugs or poorly documented behaviors in core controls or APIs.
And performance issues in particular can be difficult to fix without trial and error.
Not helpful when the llm knowledge cutoff is a year out of date and api and lib has been changed since
One of the major advantages and disadvantages of LLMs is they act a bit more like humans. I feel like most "prompt advice" out there is very similar to how you would teach a person as well. Teachers and parents have some advantages here.
Yeah this is why I don't like statistical and ML solutions in general. Monte Carlo sampling is already kinda throwing bullshit at the wall and hoping something works with absolutely zero guarantees and it's perfectly explainable.
But unfortunately for us, clean and logical classical methods suck ass in comparison so we have no other choice but to deal with the uncertainty.
prompt tuning is a temporary necessity
> no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
Challenge accepted.
That said, the exact quote from the linked notebook is "It’s generally not necessary to use all-caps or other incentives like bribes or tips, but developers can experiment with this for extra emphasis if so desired.", but the demo examples OpenAI provides do like using ALL CAPS.
references for all the above + added more notes here on pricing https://x.com/swyx/status/1911849229188022278
and we'll be publishing our 4.1 pod later today https://www.youtube.com/@latentspacepod
I'm surprised and a little disappointed by the result concerning instructions at the top, because it's incompatible with prompt caching: I would much rather cache the part of the prompt that includes the long document and then swap out the user question at the end.
The way I understand it: if the instruction are at the top, the KV entries computed for "content" can be influenced by the instructions - the model can "focus" on what you're asking it to do and perform some computation, while it's "reading" the content. Otherwise, you're completely relaying on attention to find the information in the content, leaving it much less token space to "think".
Prompt on bottom is also easier for humans to read as I can have my actual question and the model’s answer on screen at the same time instead of scrolling through 70k tokens of context between them.
Wouldn’t it be the other way around?
If the instructions are at the top the LV cache entries can be pre computed and cached.
If they’re at the bottom the entries at the lower layers will have a dependency on the user input.
It's placing instructions AND user query at top and bottom. So if you have a prompt like this:
The key-values for first 5200 tokens can be cached and it's efficient to swap out the user query for a different one, you only need to prefill 32 tokens and generate output.But the recommendation is to use this, where in this case you can only cache the first 200 tokens and need to prefill 5264 tokens every time the user submits a new query.
Ahh I see. Thank you for the explanation. I didn’t realise their was user input straight after the system prompt.
yep. we address it in the podcast. presumably this is just a recent discovery and can be post-trained away.
If you're skimming a text to answer a specific question, you can go a lot faster than if you have to memorize the text well enough to answer an unknown question after the fact.
The size of that SWE-bench Verified prompt shows how much work has gone into the prompt to get the highest possible score for that model. A third party might go to a model from a different provider before going to that extent of fine-tuning of the prompt.
>- dont self-inject/parse toolcalls (+2%)
What is meant by this?
Use the OpenAI API/SDK for function calling instead of rolling your own inside the prompt.
> - JSON BAD - use XML or arxiv 2406.13121 (GDM format)
And yet, all function calling and MCP is done through JSON...
JSON is just MCP's transport layer. you can reformat to xml to pass into model
Yeah anyone who has worked with these models knows how much they struggle with JSON inputs.
Why XML over JSON? Are they just saying that because XML is more tokens so they can make more money?