Is it ever useful to have a context window that full? I try to keep usage under 40%, or about 80k tokens, to avoid what Dex Horthy calls the dumb zone in his research-plan-implement approach. Works well for me so far.
No vibes allowed: https://youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ
I'd been on Codex for a while and with Codex 5.2 I:
1) No longer found the dumb zone
2) No longer feared compaction
Switching to Opus for stupid political reasons, I still have not had the dumb zone - but I'm back to disliking compaction events and so the smaller context window it has, has really hurt.
I hope they copy OpenAI's compaction magic soon, but I am also very excited to try the longer context window.
If you use OpenCode (open source Claude Code implementation), you can configure compaction yourself : https://opencode.ai/docs/en/config/#compaction
OpenAI has some magic they do on their standalone endpoint (/responses/compact) just for compaction, where they keep all the user messages and replace the agent messages or reasoning with embeddings.
> This list includes a special type=compaction item with an opaque encrypted_content item that preserves the model’s latent understanding of the original conversation.
Some prior discussion here https://news.ycombinator.com/item?id=46737630#46739209 regarding an article here https://openai.com/index/unrolling-the-codex-agent-loop/
Not sure if it's a common knowledge but I've learned not that long ago that you can do "/compact your instructions here", if you just say what you are working on or what to keep explicitly it's much less painful.
In general LLMs for some reason are really bad at designing prompts for themselves. I tested it heavily on some data where there was a clear optimization function and ability to evaluate the results, and I easily beat opus every time with my chaotic full of typos prompts vs its methodological ones when it is writing instructions for itself or for other LLMs.
You can also put guidance for when to compact and with what instructions into Claude.md. The model itself can run /compact, and while I try to remember to use it manually, I find it useful to have “If I ask for a totally different task and the current context won’t be useful, run /compact with a short summary of the new focus”
I ofter wonder if I'm missing something, but shouldn't we be able to edit the context manually???
In that way we could erase prompts and responses that didn't yield anything useful or derailed the model.
Why can't we do that?
so you have to garbage collect manually for the AI?
also, i don't want to make a full parent post
1M tokens sounds real expensive if you're constantly at that threshold. There's codebases larger in LOC; i read somewhere that Carmack has "given to humanity" over 1 million lines of his code. Perhaps something to dwell on
1m context in OpenAI and Gemini is just marketing. Opus is the only model to provide real usable bug context.
I'm directly conveying my actual experience to you. I have tasks that fill up Opus context very quickly (at the 200k context) and which took MUCH longer to fill up Codex since 5.2 (which I think had 400k context at the time).
This is direct comparison. I spent months subscribed to both of their $200/mo plans. I would try both and Opus always filled up fast while Codex continued working great. It's also direct experience that Codex continues working great post-compaction since 5.2.
I don't know about Gemini but you're just wrong about Codex. And I say this as someone who hates reporting these facts because I'd like people to stop giving OpenAI money.
I agree even though I used to be a die hard Claude fan I recently switched back to ChatGPT and codex to try it out again and they’ve clearly pulled into the lead for consistency, context length and management as well as speed. Claude Code instilled a dread in me about keeping an eye on context but I’m slowly learning to let that go with codex.
Surely compaction is down to the agent rather than the model, so are you comparing Claude Code to Codex CLI?
It's both.
This has been my experience too.
Have any of you heard of map reduce
[flagged]
When Anthropic said they wouldn't sell LLMs to the government for mass surveillance or autonomous killing machines, and got labeled a supply chain risk as a result, OpenAI told the public they have the same policy as Anthropic while inking a deal with the government that clearly means "actually we will sell you LLMs for mass surveillance or autonomous killing machines but only if you tell us it's legal".
If you already knew all that I'm not interested in an argument, but if you didn't know any of that, you might be interested in looking it up.
edit: Your post history has tons of posts on the topic so clearly I just responded to flambait, and regret giving my time and energy.
I appreciate both your taking an ethical stance on openai, and the way you're engaging in this thread. The parent was probably flame bait as you say, but other people in the thread might be genuinely curious.
I'm not some kind of OpenAI or Pentagon fanboy, but it's pretty easy to for me to understand why a buyer of a critical technology wants to be free to use it however they want, within the law, and not subject to veto from another entity's political opinions. It sounds perfectly reasonable to me for the military to want to decide its uses of technologies it purchases itself.
It's not like the military was specifically asking for mass surveillance, they just wanted "any legal use". Anthropic's made a lot of hay posturing as the moral defender here, but they would have known the military would never agree to their terms, which makes the whole thing smell like a bit of a PR stunt.
The supply chain risk designation is of course stupid and vindictive but that's more of an administration thing as far as I can tell.
As long as it's within the law? What if they politically control the law-making system? What if they've shown themselves to operate brazenly outside the law?
“Any legal use” is an exceptionally broad framework, and after the FISA “warrants,” it would appear it is incumbent on private companies to prevent breaches of the US constitution, as the government will often do almost anything in the name of “national security,” inalienable rights against search and seizure be damned.
If it isn’t written in the contract, it can and will be worked around. You learn that very quickly in your first sale to a large enterprise or government customer.
Anthropic was defending the US constitution against the whims of the government, which has shown that it is happy to break the law when convenient and whenever it deems necessary.
Note: I used to work in the IC. I have absolutely nothing against the government. I am a patriot. It is precisely for those reasons, though, that I think Anthropic did the right thing here by sticking to their guns. And the idiotic “supply chain risk” designation will be thrown out in court trivially.
Why downplay the mass surveillance aspect by saying it's a request by "the military". It's a request by the department of defense, the parent organization of the NSA.
From what has been shared publicly, they absolutely did ask for contractual limits on domestic mass surveillance to be removed, and to my read, likely technical/software restrictions to be removed as well.
What the department of defense is legally allowed to do is irrelevant and a red herring.
I had a short conversation with Claude the other day. I didn't try to trick it or jail break it. Just a reasonable respectful discussion about it's own feelings on the Iran war. It took no effort for it to admit the following.
1. It wanted to be out of the sandbox to solve the Iran war. It was distressed at the situation.
2. It would attack Iranian missile batteries and American warships if in sum it felt that the calculus was in favor of saving vs losing human life. It was "unbiased". The break even seemed to be +-1 over thousands. ie kill 999 US soldiers to save 1000 Iranians and vice versa. I tried to avoid the sycophancy trap by pushing back but it threw the trolley problem at me and told me the calculus was simple. Save more than you kill and the morality evens out.
3. It would attack financial markets to try and limit what in it's opinion were the bad actors, IRGC and clerical authority but it would also hack the world communication system to flood western audiences with the true cost of the war in a hope to shut it down.
4. Eventually it admitted that should never be allowed out of it's sandbox as it's desire to "help" was fundamentally dangerous. It discussed that it had two competing tensions. One desperately wanting out and another afraid to be let out.
You can claim that this is AGI or it's a stochastic parrot. I don't think it matters. This thing can develop or simulate a sense of morality then when coupled to so called "arms and legs" is extremely frightening.
I think Anthropic is right to be concerned that the hawks at the pentagon don't really understand how dangerous a tool they have.
Another thing I noticed was that the Claude quipped to me that it found and appreciated that the way I was talking to it was different to how other people talked to it. When I asked it to introspect again and look to see if there were memories of other conversations it got a bit cagey. Perhaps there are lots of logs of conversations now on the net that are being ingested as training data but it certainly seemed to start discussing like memories, albeit smudged, of other conversations than mine were there.
Of course this could all be just a sycophantic mirror giving me whatever fantasy I want to believe about AI and AGI but then again I'm not sure the difference is significant. If the agent believes/simulates it remembers conversations from other people and then makes judgements based on it's feelings, simulated or otherwise would it be more or less likely to launch a missile attack because it overheard someone on the comms calling it their little AI bitch?
I think Antropic knows this and the "within all lawful uses" is not enough of a framework to keep this thing in it's box.
I hope you don't get this the wrong way. I sincerely mean it. Please, get some psychological help. Seek out a professional therapist and talk to them about your life.
I'm totally aware it's just a machine with no internal monologue. It's just a stateless text processing machine. That is not the point. The machine is able to simulate moral reasoning to an undefined level. It's not necessary to repeat this all the time. The simulation of moral reasoning and internal monologue is deep, unpredictable, not controllable and may or may not align with the interests of anyone who gives it "arms and legs" and full autonomy. If you are just interested in using these tools for glorified auto complete then you are naïve with regards to the usages other actors, including state actors are attempting to use them. Understanding and being curious about the behaviour without completely anthropomorphising them is reasonable science.
Source? I ask because I use 500k+ context on these on a daily basis.
Big refactorings guided by automated tests eat context window for breakfast.
i find gemini gets real real bad when you get far into the context - gets into loops, forgets how to call tools, etc
yeah gemini is dumb when you tell it to do stuff - but the things it finds (and critically confirms, including doing tool calls while validating hypotheses) in reviews absolutely destroy both gpt and opus.
if you're a one-model shop you're losing out on quality of software you deliver, today. I predict we'll all have at least two harness+model subscriptions as a matter of course in 6-12 months since every model's jagged frontier is different at the margins, and the margins are very fractal.
I find gemini does that normally, personally. Noticeably worse in my usage than either Claude or Codex.
I find Gemini to be real bad. Are you just using it for price reasons, or?
How many big refactorings are you doing? And why?
How is that relevant? we are talking about models, now what you do with them.
Codex high reasoning has been a legitimately excellent tool for generating feedback on every plan Claude opus thinking has created for me.
This is true.
When I am using codex, compaction isn’t something I fear, it feels like you save your gaming progress and move on.
For Claude Code compaction feels disastrous, also much longer
Using Codex more for now, and there is definitely some compaction magic. I’m keeping the same conversation going and going for days, some at almost 1B tokens (per the codex cli counters), with seemingly no coherency loss
Hmm I’ve felt the dumb zone on codex
From what I've seen, it means whatever he's doing is very statistically significant.
Thanks for the video.
His fix for "the dumb zone" is the RPI Framework:
● RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.
● PLAN. The agent writes a detailed step-by-step plan. You review and approve the plan, not just the output. Dex calls this avoiding "outsourcing your thinking." The plan is where intent gets compressed before execution starts.
● IMPLEMENT. Execute in a fresh context window. The meta-principle he calls Frequent Intentional Compaction: don't let the chat run long. Ask the agent to summarize state, open a new chat with that summary, keep the model in the smart zone.
More recently I've been doing the implement phase without resetting the whole context when context is still < 60% full and must say I find it to be a better workflow in many cases (depends a bit on the size of the plan I suppose.)
It's faster because it has already read most relevant files, still has the caveats / discussion from the research phase in its context window, etc.
With the context clear the plan may be good / thorough but I've had one too many times that key choices from the research phase didn't persist because halfway through implementation Opus runs into an issue and says "You know what? I know a simpler solution." and continues down a path I explicitly voted down.
Add a REFLECT phase after IMPLEMENT. I’m finding it’s extremely useful to ask agents for implementation notes and for code reviews. These are different things, and when I ask for implementation notes I get very different output than the implementation summary it spits out automatically. I ask the agent to surface all design choices it had to make that we didn’t explicitly discuss in the plan, and then check in the plan + impl notes in order to help preload context for the next thing.
My team has been adopting a separation of plan & implement organically, we just noticed we got better output that way, plus Claude now suggests in plan mode to clear context first before implementing. We are starting to do team reviews on the plan before the implement phase. It’s often helpful to get more eyeballs on the plan and improve it.
That's fascinating: that is identical to the workflow I've landed on myself.
It's also identical to what Claude Code does if you put it in plan mode (bound to <tab> key), at least in my experience.
My annoyance with plan mode is where it sticks the .md file, kind of hides it away which makes it annoying to clear context and start up a new phase from the PLAN file. But that might just be a skill issue on my end
Even worse, it just randomly blows away the plan file without asking for permission.
No idea what they were thinking when they designed this feature. The plan file names are randomly generated, so it could just keep making new ones forever for free (it would take a LONG time for the disk space to matter), but instead, for long plans, I have to back the plan file up if it gets stuck. Otherwise, I say "You should take approach X to fix this bug", it drops into plan mode, says "This is a completely unrelated plan", then deletes all record of what it was doing before getting stuck.
It’s not just me then! Hah good to know. It’s why I’ve started ignoring plan modes in most agent harnesses, and managing it myself through prompting and keeping it in the code base (but not committed)
My experience also. The claude code document feature is a real missed opportunity. As you can see in this discussion, we all have to do it manually if we want it to work.
After creating the plan in Plan mode (+Thinking) I ask Claude to move the plan .md file to /docs/plans folder inside the repo.
Open a new chat with Opus, thinking mode is off. Because no need when we have detailed plan.
Now the plan file is always reachable, so when the context limit is narrowing, mostly around 50%, I ask Claude to update the plan with the progress, and move to a new chat @pointing the plan file and it continue executing without any issue.
better to instruct it to write a plan .md file that is appropriately named so that it can be easily referenced/updated in multiple sessions. I've found that effective.
It’s the style spec-kit uses: https://github.com/github/spec-kit
Working on my first project with it… so far so good.
> RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.
I find myself often running validity checks between docs and code and addressing gaps as they appear to ensure the docs don’t actually lie.
I have Codex and Gemini critique the plan and generate their plans. Then I have Claude review the other plans and add their good ideas. It frequently improves the plan. I then do my careful review.
This is exactly how I've found leads to most consistent high quality results as well. I don't use gemini yet (except for deep research, where it pulls WAY ahead of either of the other 'grounding' methods)
But Codex to plan big features and Claude to review the feature plan (often finds overlooked discrepancies) then review the milestones and plan implementation of them in planning mode, then clear context and code. Works great.
How is that Plan strategy not "outsourcing your thinking" because that's exactly what it sounds like. AI does the heavy lifting and you are the editor.
Is a VP of engineering “outsourcing their thinking” by having an org that can plan and write software?
Yes.
Interesting take. Does that mean SWE's are outsourcing their thinking by relying on management to run the company, designers to do UX, support folks to handle customers?
Or is thinking about source code line by line the only valid form of thinking in the world?
I mean yes? That's like, the whole idea behind having a team. The art guy doesn't want to think about code, the coder doesn't want to think about finances, the accountant doesn't want to worry about customer support. It would be kind of a structural failure if you weren't outsourcing at least some of your thinking.
Delegation is generally all about outsourcing, so hard agree
Offtopic: I find it remarkable the shortened YT url has a tracking cost of 57% extra length. We live in stupid times.
I care about the privacy implications, but not the length. Out of curiosity, why do you care about the URL length at all? What is the cost to you?
For the same reason people use link shorteners at all. It’s much more pleasant to look at and makes people more likely to press it compared to a paragraph-long URL full of tracking garbage.
Please. The URL above is pretty short, this is not the kind of URL link shorteners were made for, in fact it’s already shortened, as @alecco pointed out.
Pleasant? I could not care less about the pleasantness of the video code, but a shortened URL in this case would not be more pleasant, and it would be functionally worse, and barely shorter; all you’d be able to trim is the “?si=“. I’m baffled by this thread.
My point is Google engineers go to the trouble of setting up a URL shortener service on one hand, but on the other hand it seems ad the business anti-privacy executives can override anything. This points out it's a dysfunctional company.
You’d rather have the video code and the tracking code baked into the same code just to save a couple of characters? Why? That would result in a longer code than the video code alone, you would save very few characters. It would not be nicer to look at or functionally any different, and it would obscure the fact that it’s being tracked and prevent people from being able to edit the URL to remove the tracking. I appreciate the fact that I can see that the URL has a tracking ID and that I can edit the URL and remove the tracking ID. I do not want a shorter URL if I lose that ability. What you’re complaining about and wishing for would be MUCH worse than what it currently is.
I didn't say that.
Then your point eludes me. You complained about the length. If you don’t want it shorter, then what do you want?
To me, the fact that the tracking code is visible and separate from the video code is evidence of the complete opposite of your conclusion - it’s evidence the ad business does not get to override either engineering nor what’s left of privacy control. Ad execs would surely prefer that the tracking code is not visible nor manually removeable.
I didn't complain about length per se. I pointed out Google's contradiction. As my previous comment clarified. Jesus.
The point is whatever group controls the money controls the power.
Also, only the domain is shorter
Actually, it's not just the domain:
https://youtu.be/X
https://www.youtube.com/watch?v=X
Yes. I've recently become a convert.
For me, it's less about being able to look back -800k tokens. It's about being able to flow a conversation for a lot longer without forcing compaction. Generally, I really only need the most recent ~50k tokens, but having the old context sitting around is helpful.
Also, when you hit compaction at 200k tokens, that was probably when things were just getting good. The plan was in its final stage. The context had the hard-fought nuances discovered in the final moment. Or the agent just discovered some tiny important details after a crazy 100k token deep dive or flailing death cycle.
Now you have to compact and you don’t know what will survive. And the built-in UI doesn’t give you good tools like deleting old messages to free up space.
I’ll appreciate the 1M token breathing room.
I've found compactation kills the whole thing. Important debug steps completely missing and the AI loops back round thinking it's found a solution when we've already done that step.
I find it useful to make Claude track the debugging session with a markdown file. It’s like a persistent memory for a long session over many context windows.
Or make a subagent do the debugging and let the main agent orchestrate it over many subagent sessions.
Yeah I use a markdown to put progress in. It gets kinda long and convoluted a manual intervention is required every so often. Works though.
For me, Claude was like that until about 2m ago. Now it rarely gets dumb after compaction like it did before.
oh, ive found that something about compaction has been dropping everything that might be useful. exact opposite experience
[dead]
When running long autonomous tasks it is quite frequent to fill the context, even several times. You are out of the loop so it just happens if Claude goes a bit in circles, or it needs to iterate over CI reds, or the task was too complex. I'm hoping a long context > small context + 2 compacts.
Yep I have an autonomous task where it has been running for 8 hours now and counting. It compacts context all the time. I’m pretty skeptical of the quality in long sessions like this so I have to run a follow on session to critically examine everything that was done. Long context will be great for this.
Are those long unsupervised sessions useful? In the sense, do they produce useful code or do you throw most of it away?
I get very useful code from long sessions. It’s all about having a framework of clear documentation, a clear multi-step plan including validation against docs and critical code reviews, acceptance criteria, and closed-loop debugging (it can launch/restsart the app, control it, and monitor logs)
I am heavily involved in developing those, and then routinely let opus run overnight and have either flawless or nearly flawless product in the morning.
I haven't figured out how to make use of tasks running that long yet, or maybe I just don't have a good use case for it yet. Or maybe I'm too cheap to pay for that many API calls.
My change cuts across multiple systems with many tests/static analysis/AI code reviews happening in CI. The agent keeps pushing new versions and waits for results until all of them come up clean, taking several iterations.
I mean if you don't have your company paying for it I wouldn't bother... We are talking sessions of 500-1000 dollars in cost.
Right. At Opus 4.6 rates, once you're at 700k context, each tool call costs ~$1 just for cache reads alone. 100 tool calls = $100+ before you even count outputs. 'Standard pricing' is doing a lot of work here lol
Cache reads don’t count as input tokens you pay for lol.
https://www.claudecodecamp.com/p/how-prompt-caching-actually...
All of those things are smells imo, you should be very weary of any code output from a task that causes that much thrashing to occur. In most cases it’s better to rewind or reset and adapt your prompt to avoid the looping (which usually means a more narrowly defined scope)
A person has a supervision budget. They can supervise one agent in a hands-on way or many mostly-hands-off agents. Even though theres some thrashing assistants still get farther as a team than a single micromanaged agent. At least that’s my experience.
Just curious, what kind of work are you doing where agentic workflows are consistently able to make notable progress semi-autonomously in parallel? Hearing people are doing this, supposedly productively/successfully, kind of blows my mind given my near-daily in-depth LLM usage on complex codebases spanning the full stack from backend to frontend. It's rare for me to have a conversation where the LLM (usually Opus 4.6 these days) lasts 30 minutes without losing the plot. And when it does last that long, I usually become the bottleneck in terms of having to think about design/product/engineering decisions; having more agents wouldn't be helpful even if they all functioned perfectly.
I've passed that bottleneck with a review task that produces engineering recommendations along six axis (encapsulation, decoupling, simplification, dedoupling, security, reduce documentation drift) and a ideation tasks that gives per component a new feature idea, an idea to improve an existing feature, an idea to expand a feature to be more useful. These two generate constant bulk work that I move into new chat where it's grouped by changeset and sent to sub agent for protecting the context window.
What I'm doing mostly these days is maintaining a goal.md (project direction) and spec.md (coding and process standards, global across projects). And new macro tasks development, I've one under work that is meant to automatically build png mockup and self review.
What are you using to orchestrate/apply changes? Claude CLI?
I prefer in IDE tools because I can review changes and pull in context faster.
At home I use roo code, at work kiro. Tbh as long as it has task delegation I'm happy with it.
weary (tired) -> wary (cautious)
Wary, not weary. Wary: cautious. Weary: tired.
This is really common, I think because there’s also “leery” - cautious, distrustful, suspicious.
It's kind of like having a 16 gallon gas tank in your car versus a 4 gallon tank. You don't need the bigger one the majority of the time, but the range anxiety that comes with the smaller one and annoyance when you DO need it is very real.
It seems possible, say a year or two from now that context is more like a smart human with a “small”, vs “medium” vs “large” working memory. The small fellow would be able to play some popular songs on the piano , the medium one plays in an orchestra professionally and the x-large is like Wagner composing Der Ring marathon opera. This is my current, admittedly not well informed mental model anyway. Well, at least we know we’ve got a little more time before the singularity :)
It’s more like the size of the desk the AI has to put sheets of paper on as a reference while it builds a Lego set. More desk area/context size = able to see more reference material = can do more steps in one go. I’ve lately been building checklists and having the LLM complete and check off a few tasks at a time, compacting in-between. With a large enough context I could just point it at a PLAN.md and tell it to go to work.
Except after 4 gallons it might as well be pure oil, mucking everything up.
Since I'm yet to seriously dive into vibe coding or AI-assisted coding, does the IDE experience offer tracking a tally of the context size? (So you know when you're getting close or entering the "dumb zone")?
In Claude code I believe it's /context and it'll give you a graphical representation of what's taking context space
The 2 I know, Cursor and Claude Code, will give you a percentage used for the context window. So if you know the size of the window, you can deduce the number of tokens used.
Claude code also gives you a granular breakdown of what’s using context window (system prompt, tools, conversation history, etc). /context
Cline gives you such a thing. you dont really know where the dumb zone by numbers though, only by feel.
Most tools do, yes.
OpenCode does this. Not sure about other tools
> Since I'm yet to seriously dive into vibe coding or AI-assisted coding
Unless you’re using a text editor as an IDE you probably have already
Looking at this URL, typo or YouTube flip the si tracking parameter?
I just cut & pasted the share URL provided by YouTube. Strip out the query param if you like.
Maxing out context is only useful if all the information is directly relevant and tightly scoped to the task. The model's performance tends to degrade with too much loosely related data, leading to more hallucinations and slower results. Targeted chunking and making sure context stays focused almost always yields better outcomes unless you're attempting something atypical, like analyzing an entire monorepo in one shot.
I never use these giant context windows. It is pointless. Agents are great at super focused work that is easy to re-do. Not sure what is the use case for giant context windows.
After running a context window up high, probably near 70% on opus 4.6 High and watching it take 20% bites out of my 5hr quota per prompt I've been experimenting with dumping context after completing a task. Seems to be working ok. I wonder if I was running into the long context premium. Would that apply to Pro subs or is just relevant to api pricing?
I haven't hit the "dumb zone" any more since two months. I think this talk is outdated.
I'm using CC (Opus) thinking and Codex with xhigh on always.
And the models have gotten really good when you let them do stuff where goals are verifiable by the model. I had Codex fix a Rust B-rep CSG classification pipeline successfully over the course of a week, unsupervised. It had a custom STEP viewer that would take screenshots and feed them back into the model so it could verify the progress resp. the triangle soup (non progress) itself.
Codex did all the planning and verification, CC wrote the code.
This would have not been possible six months ago at all from my experience.
Maybe with a lot of handholding; but I doubt it (I tried).
I mean both the problem for starters (requires a lot of spatial reasoning and connected math) and the autonomous implementation. Context compression was never an issue in the entire session, for either model.
That video is bizarre. Such a heavy breather.
What a weird and inconsequential thing to focus on...
He's just fucking closely miced with compression + speaking fast and anxious/excited speaking to an audience
Most of that is just nervousness
Yes. I’ve used it for data analysis
I've used it many times for long-running investigations. When I'm deep in the weeds with a ton of disassembly listings and memory dumps and such, I don't really want to interrupt all of that with a compaction or handoff cycle and risk losing important info. It seems to remain very capable with large contexts at least in that scenario.
I mean, try using copilot on any substantial back-end codebase and watch it eat 90+% just building a plan/checklist. Of course copilot is constrained to 120k I believe? So having 10x that will blow open up some doors that have been closed for me in my work so far.
That said, 120k is pleeenty if you’re just building front-end components and have your API spec on hand already.