This to me reads like a poignant commentary on the catastrophic loss of human agency, with the actual commit being highly revealing [0].

Author wants to hide a horizontal scrollbar. Any junior frontend dev worth their salt will be asking right away "where do I stick `overflow-x: hidden;`?" A complete solution will then require hitting "Inspect element" in the browser to find the CSS class and running (rip)grep to find where it is in code, to then add a single line to.

An actual proactive programmer might start asking more pointed questions like what content does an empty textbox have that it overflows? And why do I need to insert this workaround that treats the symptom and not the root cause in two different places? Isn't it better to style `textarea` once? Etc, etc.

[0] https://github.com/datasette/datasette-agent/commit/a75a8b72...

They might also ask why a bunch of static CSS inside a bunch of JavaScript is hiding inside __init__.py[0] - hopefully before trying to fix some detail of the CSS.

(I'm surprised to see it actually, since my own use of Claude has mostly yielded well-structured code. But I'm not doing proper vibe-coding, more like friendly Socratic arguing with another engineer who happens to be a robot.)

[0] https://github.com/datasette/datasette-agent/blob/main/datas...

Thanks for the prod, I've extracted that script out into a separate static file: https://github.com/datasette/datasette-agent/commit/fa505b82...

(It was in Python because there were a couple of URLs that needed to be dynamically constructed by the server, but those are output as a small window.datasetteAgentJumpConfig object instead now.)

Thanks for continuing to engage in the community despite such horrid responses from a few.

[dead]

> friendly Socratic arguing with another engineer who happens to be a robot

Ha! Same! Still feels like the best way to go about it, really. I know the dream is to one day remove humans from the loop... but I'll enjoy the dialectic while it still seems the most productive!

Same, I like to call it rubber duck coding (now the duck talks back!)

Edit: Now I want an LLM connected rubber duck with a speaker/microphone that sees your screen

Seems like this model delivers on what has already been scaling quite nicely, which is the length and complexity of the requested tasks, but isn't such a big improvement on what hasn't been scaling so far - common sense, discernment, good judgement.

> common sense, discernment, good judgement

I feel like the whole point of all the experimentation with AI right now is determining whether any of these things actually matter to the end result, over various timeframes.

They matter.

Because?

Because poor judgement leads to poor decisions.

[deleted]

This is exactly right. By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction with additional information and improve it. Instead, we let the agent spend $12 and make the fix while learning nothing.

Things I learned from this:

- Fable will do a whole lot more than you might expect in order to verify a fix. I learned that it's "relentlessly proactive". That's a good title for a blog entry!

- You can take screenshots of a window in macOS using the "screencapture" CLI command, but you'll need the integer window ID first.

- That windowID is accessible via "Quartz.CGWindowListCopyWindowInfo(Quartz.kCGWindowListOptionOnScreenOnly, Quartz.kCGNullWindowID)" using the pyobjc-framework-Quartz library, which installs cleanly via "uv run".

- A neat trick for simulating keyboard shortcuts is to run document.dispatchEvent(new KeyboardEvent("keydown", {key: "/", bubbles: true})); after the page loads.

- You don't need Flask or Starlette to run a CORS-enabled localhost server for capturing JSON from another window - 19 lines of code against the Python standard library http.server package works just fine.

- getComputedStyle(document.querySelector("navigation-search").shadowRoot.querySelector("textarea")) works to read dimensions from inside a Web Component's shadow DOM.

- defaults write com.google.chrome.for.testing AppleShowScrollBars Always

- Claude Fable knows how to apply all of the above. It's always interesting to pick up hints of what a model can and cannot do.

I'm always confused at how many people equate using a coding agent to solve a problem with "learning nothing". If you pay attention to what it's doing you can learn so much!

Sorry that wasn't a criticism of you!

I completely see how it was misread that way. I would edit it now if I could.

I was using you more as an example of a hypothetical programmer using it in this way. If the goal is to create a maintainable product, this isn't a great approach. If the goal is to learn about the model and its behaviors itself, of course this is a fantastic way to experiment. Yes, you might have learned a lot of tricks as a side effect, but avoiding the pain of thinking about, finding and hiding the thing may mask a better abstraction that reduces complexity and allows the project to move forward faster.

Honestly my goal is to learn how to teach an agent to build a maintainable product, so I'm way more interested in the learnings at the agentic level (how to prompt/direct/manage context/restrict tool use, provide reusable shims, etc) than getting into the details of a css bug. That's just not a level of abstraction with sufficient leverage for what I'm trying to do.

I stopped coding a while back because I could have more impact directing a team of developers than writing code personally.

For my use case, the agents are now how I can have that scaled impact.

Absolutely. All of these "but you could have done that easily" from frontend developers or backend developers or systems engineers -- like yea, if I have the time or interest in those things, sure. But I don't. I care about an end product way way more. Blows my mind that there are legions of people building things that they don't think are important enough to get to the finish line quickly and efficiently.

> If you pay attention to what it's doing you can learn so much!

I think your post is fair but it's worth pointing out that learning via watching is much less effective than learning via doing.

I used to believe that was universally true, but then I learned about the "worked-example effect": https://en.wikipedia.org/wiki/Worked-example_effect

Your link mentions the expertise reversal effect where the redundancy of worked examples can actually hamper an experienced students abilities, vs. letting the more experienced student work it out for themselves.

It leads to less cohesive shared vision on how to solve problems. In groups where I am trying to foster a shared technical vision, I try to get people to do “see one, do one, teach one” for procedures that are common enough to come up repeatedly (and as a method for discovery for where automation would be a bigger win). Pure green-fields software dev sometimes is doing such novel things that that doesn’t work well, but much of routine software maintenance is discovery of the steps needed to add a new flow or a new customer type or a new configurable behavior, which benefit from consistency.

The whole saga is kind of nuts, but the thing that fascinates me most is that Fable got this far and then hit some kind of guardrail; I'd be very curious to know what it wasn't able to do that caused it to downgrade to Opus.

It already got extremely... invasive? It didn't do anything that I wouldn't have approved in the same case, but it's interesting that it got as far as launching browsers, inspecting every open window, and storing screenshots to disk, and then it was stopped by something? I wonder what.

It feels like there should be a budget approval, in that particular case $12 worth of KW/h - tokens were spent, without a clear approval.

Opus also do this kind of tehcnically competentent but dumb deviations to fix a simple issue where asking for input would be better. Models have no illative sense.

It sounds like you learned lots of things related to the tool, but not so much about the problem that you were using the tool to solve?

Is that fair? Not trying to snark? I see similar results myself

Learning doesn't happen in a vacuum. Even pre-LLM days where I'd scour stack overflow for the solution to one problem, I'd inadvertently learn other random stuff while looking.

Yes, that's entirely fair.

It was only pursuing the goal you gave it - Keep Summer Safe.

[deleted]

"Oh my God"

I relent to snarky Rick and Morty quotes because I don't know that it's useful any more to try to explain paperclip optimizers or alignment to a bunch of AI nerds who saw the cliff coming and clawed at each other trying to be the first out to leap over the edge.

"Relentlessly proactive". That's one word for it. We have a whole subgenre of hard takeoff scenarios and it wasn't enough warning against "Relentlessly proactive".

Turns out Frank Herbert was an optimist, and we're literally pinning our survival on robots turning out to naturally have impractically short attention spans.

> Turns out Frank Herbert was an optimist, and we're literally pinning our survival on robots turning out to naturally have impractically short attention spans.

Some people are working as hard as they can to increase it though.

That's a lot learned about debugging, sure, but it's worthwhile to note that it doesn't tell you much about the abstractions used to build Datasette, as the previous commenters pointed out.

I designed those abstractions myself.

Are you using Claude Code or a different agent? I'm curious how screenshots are being fed back into the model? Does CC register a tool for this, or is Fable just using a bash tool to perform the screen capture, and then what tool is it using to request the resulting image to be fed back to it?

Claude Code can process images by reading the files. And as I found out the other day, it also knows ffmpeg well enough to process videos even though it has no native video capabilities...

While debugging, it asked me to pass it a video from the past testing, proceeded to generate a "contact sheet" of the video using ffmpeg, interpreted the image to figure out which frames it needed, and extracted the full size frames and extracted the relevant text from it and used it to reproduce the problem with Playwright...

It would be interesting to know if examples like this are things they explicitly trained it to do (presumably via RL), or if any of it is emergent. I'd have to guess trained, but in any case still impressive the lengths it will go to!

It's hard to tell. Training it with lots of examples of ffmpeg would not be surprising, and training it on screenshots would also make a lot of sense. It's not inconceivable at all they'd train it on "figure out a video by creating contact sheets". The whole end to end I'd consider less likely, but it'd also be a very small leap once you have the elements.

I think a lot will fall out naturally from relative modest levels of reasoning plus in-depth knowledge of what common tools will do. E.g. I also have used Claude to debug my compiler, and it knows gdb so much better than me that even though I know it's pretty useless at holding context through reading an assembly listing (lack of structure, I suspect), it's surprisingly good at working things out by just being good at exploiting a powerful tool.

I was using the Claude Code CLI harness. It can "read" any image file on disk, so all it needs is a way to create a file in one of the standard formats supported by the Anthropic API.

It's like saying you can learn so much about math from using SymPy to solve equations. Yes, you probably can. If you pay close attention to what is happening and can integrate the techniques being used into your knowledge.

But your learnings here are what, a handful of hacks? For most people it's like being shown the chain rule (which frankly, is more general than any of these learnings) without knowing what a derivative is. It's knowledge that comes context free. And even when it can be understood, I'm not sure I believe it gets integrated especially well when you did none of the work to understand it. If you are extremely diligent and self-aware about what your limitations are, and careful to be sure you have an understanding of this knowledge, sure I guess you can learn a lot.

And ultimately what do you think is more likely? People using the experience of using these tools to progress their knowledge or for them to rely on the answers uncritically? I think people with a rosy view about this are severely undercounting the problems associated with the trust relationship between a person and an LLM and what that means.

> I think people with a rosy view about this are severely undercounting the problems associated with the trust relationship between a person and an LLM and what that means.

Personally I think the impact of LLMs on children's education is a crisis right now.

Kids are not going to learn to write if an LLM writes their essays for them. And writing is how you learn to think.

> writing is how you learn to think.

There's also reading. A lot of reading can substitute some writing.

EDIT: Actually, I'd say that at first you need to do a lot of reading and _then_ writing can help your thinking as well.

I don't think it's just a problem for kids! I think this is problem for many software engineers as well! Adults of all professions really.

[dead]

[flagged]

And Fable is still worse than Codex.

I use both and the only thing (as always) that I will use Claude for is UI design.

Opus 4.8 and now Fable are still both worse at actually getting the job done than the Codex model. Claude models write FAR too much code when it's not needed, they burn far too many tokens, when they are not needed, write un-necessary tests, write plans which are 5 pages longer than are needed, etc. etc.

Have you actually compared code quality and plan quality versus Codex? It's demonstrably worse.

I don't know what problems you're working on but Fable is not just better, it is a step change from GPT 5.5 in my experience. It feels at least one major model generation ahead.

One Hacker News commenter says it's worse, another retorts it's a step change and even includes emphasis! Will the first commentor retort back that it's been a double dog step change in the opposite direction? Can't wait to see how this comment thread unfolds!

It doesn't for me. I use Fable to make plans, then give them to GPT 5.5 to review, and it always finds flaws and edge cases that Fable misses (some are really critical). It was the same with Opus 4.8. I'll admit it finds a bit fewer issues now, but Fable feels more like an incremental improvement than a major generation ahead.

For that test you have to compare letting a fresh agent (subagent) or the same model do the same review.

The fact that a review helps does not prove the model choice for the review.

You reviewing your own writing helps too!

This is exactly what I find too, I make plans in both models and compare them in the other model. And Claude usually agrees (65-80% of the time) that the Codex plan included things it didn't think of, or was better in some other way.

Note, this is better than it was with Opus, where it was more like 90% of the time the Codex plans were obviously better.

Curious, which model do you use for Codex? I'm very happy with the solutions '5.5 high' finds. It's like it understands exactly what I mean and it also anticipates all sorts of situations. Before I used '5.5 medium' for some time and it was a bit underwhelming. It may sound funny but it's like it didn't care that much to do a good job.

I use GPT 5.5 High Fast, I often benchmark versus Fable (and previously did versus Opus) and it's night and day.

Claude still (and has always) writes far too much code to fulfill a given spec or plan. It misses edge cases and is generally far too verbose.

Claude also is (and even more so with Fable) super tokenmaxxing, i.e. it seems tuned to use the max amount of tokens per task, whereas Codex will simply get your job done as you specified with the minimum fuss and tokens.

Codex feels way more steerable and just more "professional" as though I'm working with a seasoned engineer, versus someone smart but over excitable, like a super smart associate engineer.

What are your harnesses? Do you have the same skillsets/tools/etc for both?

I use Codex and Claude Code. I've used both Codex and CC since release with basically every model they've ever released, I always try both for almost every plan that I write and benchmark the plans against each other, Claude almost always acknowledges that the Codex plan is better! Even now with Fable, this still happens.

As in, I give the exact same prompt to Fable and GPT 5.5 Pro, then produce the plans, then give each model the other's plan. Claude always realizes it missed stuff and Codex usually ends up finding missing things in Claudes plan.

This situation did improve with Fable versus Opus 4.8, but in general, Codex for me is still the better model.

In my experience writing about 50 programs with fable, opus, and GPT, fable is a significant step change better than opus which is significantly better than GPT. We must be doing different things.

From what I’ve seen all three are close enough that I would be hard pressed to pick one. It seems to matter much more how I prompt than which of the three I am using.

I'm writing low-level Rust, distributed systems, also sandboxing tech which has to be secure and performant.

The only thing I have Fable do now is create UIs or otherwise front-ends for systems where correctness doesn't matter as much.

Anthropic models lead at making nice looking UIs for sure, but when it comes to making sure my Rust code is actually 100% correct and uses 1% of CPU most of the time, Codex is king.

definitely not in my experience. I usually write distributed systems and back end code, and Fable is so much better at those than Codex that it's not even a comparison. Fable feels like it's a year ahead.

Interesting, I’d love to see the comparisons of your system using Claude vs Codex. I have about 20 years of experience in distributed systems and super high scale at several faangs, and also building ai model serving infra for 20k transactions per second roughly.

For me, Claude makes bone headed decisions all the time, like glaring errors, not even particularly subtle.

But the more obvious flag is the amount of irrelevant code and tests which Fable writes. Like it regularly writes 2X or 3X the amount of code and tests that are needed. It’s an expert at writing plausible but entirely useless tests.

But I think that if you’re a more junior engineer or haven’t been around a the block you can easily think that “more code equals smarter”. Claude ends up creating a massive, hard to manage codebase, and if you look the Claude Code codebase (which was leaked), you can see I’m right!

The Claude Code codebase is terrible. And presumably Anthropic has been using their smartest models for working on Claude Code. I wrote my own coding harness with Codex (as a fun experiment) which used a fraction of the code and is about 100X more performant and memory efficient (than Claude Code)!

But Simon is not trying to get good at CSS debugging, Simon is trying to learn about AI systems and produce content about them. So giving the AI agent a trivial task to go crazy on is a feature, not a bug.

For $12 implied cost, he got a front-page post on HN with 500 comments. What is that worth? :-)

> What is that worth? :-)

This is one of those double edge sword situations. It is on the front page and it stays because it will trigger a lot of people and he has to spend a lot of effort explaining himself. What is that worth?

His explanations would most likely be buried deep so the impression that others get might be worsened. What is that worth?

In my opinion, this is one of those find a harder problem and you would still have the same content...but it might not draw as much feedback and stay on the front page longer.

To most of us that's worth a ton, whereas he's probably had enough front-page posts that there's less value to him, although still likely more than $12 worth.

>enough front-page posts that there's less value to him

On the countrary I'd say it's probably even more important - without (amongst doing other "thought leader" things) getting on the HN front-page regularly an influencer's value to the industry disappears (not criticising him here)

That's bad news for all of the other "AI influencers", off the top of my head I can't think of any with remotely my track record of hitting HN.

(That's because they're all busy attracting millions of views on TikTok and YouTube, which are much more impactful channels than my dedication to blogging like it's 2005.)

That's what I meant by other thought leadership things - that's all covering different niches. For what it's worth, I think you do useful work and are a respectible influencer.

I'd also say don't be down about your use of blogging - I'd say it makes you more valuable, there aren't that many decision-makers who are going to sit through a bunch of breathless YouTube videos...

P.S. I hope you don't object to me using the term influencer, assumed you were on-board with it since in your post announcing your sponsorship you referenced Freeman & Forrest, "influencers on tap" / "building turnkey influencer marketing programs as a service".

Hah, yeah I'm still a little sore at the "influencer" term but I'm beginning to accept that it applies and I should get comfortable with it!

People are missing that Willison is among the very best people we have in the role of (for lack of a good name): early access to frontier models, evaluate them in real scenarios, no wishful thinking, hype, or doom, communicate the possibilities. Yes he could have fixed this himself but then he would have learned nothing about the AI, and we wouldn't have read a fascinating and important article.

>> he would have learned nothing about the AI

there is absolutely zero value in spending time to learn about new models as in few months new model will be out and whatever you learned about the current one will be useless.

Also with models getting better and better you have to know less and less to achieve same results.

My experience has been the exact opposite.

As the models get better you need to know more about their capabilities, because otherwise you risk prompting Claude Fable 5 like it's GPT-4o and complaining loudly about how it's all hype and nothing about these models is improving at all (yes, I do see people say that.)

Getting the best results out of these models requires skill, experience, intuition, and domain expertise. There's always room for improving every one of those.

The new benchmark for LLMs is how much of simonw's new know-how is required.

Lower bars are better.

>> Getting the best results out of these models requires skill, experience, intuition, and domain expertise.

domain expertise has nothing to do with llms. On the contrary, to have it you need to avoid llms.

>>you risk prompting Claude Fable 5 like it's GPT-4o

Thats fine because when GPT came out you had to treat it like a baby, GPT2 and around that time "Prompt engineering" was a thing.

Now its all dead.

After opus 4.8 all you have to do is say "fix it" or add /plan. All that time spend on learning previous models is time wasted.

And in a year or two with developed harness you will be out of the loop, errors are incoming - llm fixes them or adds new features based on some transcripts etc.

Even if model development stops now - there is nothing to learn really. Sure you may need to adjust prompt style a bit. You will do it naturally just like when you communicate with a new person. There is no "knowledge" to it, it is very smart.

I agree but this particular example showed nothing about leveraging skill, experience, or intuition. If anything, this is another straightforward example of a one shot ask.

edit: that said, I understand this particular post is about model capability

Eh, I've have the exact opposite experience.

Way back before instruct models it was pretty difficult, but for the last couple of years I haven't needed anything more complex than the type of text that I might send in a detailed email to a colleague.

Isn't the whole point of a better model that it should be better at understanding you than the previous one? So the same prompt should return a better answer.

Prompting differently to the new model seems entirely backwards when trying to determine if the model has improved.

It doesn't matter how good the models get, they still won't be able to act on unclear directions.

Learning to provide unambiguous, clear directions is a skill. A lot of people who report bad experiences with models aren't yet good at that skill.

More importantly though, the key to successful communication is having a good understanding of what the other side of the conversation already knows and understands.

Saying "use uv and inline script dependencies" won't mean anything to a model with a knowledge cutoff date prior to the launch of uv!

It's perfectly possible to act on unclear directions. The correct course of action is asking clarifying questions.

I think this is true when models were going from bad to pretty good like happened last year. But when they start to get good, and can work deeper and with more nuance, how you prompt also can change the results quite a bit. Note this is also true of asking smart humans to do things; personality and approaches vary, they don’t exist on a single axis continuum of quality

[dead]

There’s zero value? Surely you don’t believe zero, it’s potentially the most powerful predictive AI in the world ever made? Maybe only incremental steps sure. But also their IPO is coming, you don’t want people evaluating them beforehand?

What is intelligence? Better to call it LLM.

you know, women make a big deal about you meeting their father/parents, and honestly, I'm too autistic to really fucking have put any importance until now as to why that was remotely important, but if N+1 is coming for your job, it seems it might be worth your while to know the capabilities of N, no?

[dead]

I see it as a prioritization exercise. I know the above is a trivial example, but more generally, does the guy who wrote Datasette and Django want to wrangle front end and css, or do they want to work on something else?

> By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction [...]

While by itself that would be true, Simon commonly blogs about things he's up to.

That action provides the opportunity for evaluation, and additionally evaluation by a wider audience.

So, it's not the same scenario as non-bloggers offloading a task... :)

[deleted]

[flagged]

Here's a handy calculator you can use to estimate how much CO2 and water I wasted with my coding agent session: https://www.andymasley.com/visuals/ai-prompt-footprint/

Not sure what point you wanted to make, but this calculator is quite shocking. GPT 5.5 pro, with "a long document" and 10 requests a day gives 25% of daily CO2 emissions!

Ten coding sessions a day with Opus is still 4.7%!

This feels enormous. I will definitely stop rolling my eyes when people complain about AI CO/water usage...

GPT-5.5 Pro is a notoriously expensive model, it's 6x the price of GPT-5.5. Not something to use as a daily driver!

That ten coding sessions a day with Opus number feels more credible to me.

What are you on about? May be 1 out of 100,000 users are using 5.5 Pro to make 10 "Long Documents" as defined in that tool EVERY day. What a silly thing to harp on.

Six 100,000 token Claude coding sessions use less energy than a dryer load, and less water than making one egg. If you are truly concerned about energy and water usage, AI is not even in the top 100 things you should be concerned about in your daily life.

[dead]

This very obtusely ommits the demand for new data centers and related infrastructure that using AI creates, the going "vegan for a year" option assumes less cows being born but somehow the "don't use AI" doesn't assume that the data center wasn't build in the first place.

The discrete number of cows being born is theoretically fine-grained enough to actually respond to 2–3 vegans yielding one fewer cow. It's unlikely on a one-year time scale, but one cow only goes so far.

Even a thousand AI objectors aren't going to limit the demand for a data center, in no small part because these investments are only partially driven by current demand and are significantly driven by expectation of future demand. And they're really not going to lead to smaller data centers either because if you're building a data center in the first place you're going to spec it out for future demand.

Regardless, I think in both cases it's important to be realistic about the actual impact that one person has. If that number is disappointingly small, that serves as signal that your conscientious objection isn't making the industry you're objecting to as uncomfortable as you would like to think. It may still be worth objecting for your own sense of self, or maybe it serves as an invitation to evangelize your position more, but either way there's not much value to measuring things in a way that gives you an illusion of greater impact than you actually have.

The real point is not "one session", it's the fact that people now do that routinely, that CICD are using those to check every commit, and each search engine query now does that too, so it multiplies

[flagged]

As someone who actually gives a shit about the environment and global warming and has been putting this into practice for more than a decade through daily personal sacrifices: no, I downvote it because if you properly look into it, AI is just completely insignificant compared to cars, air travel, clothing, food, needless junk and so on that it's a joke. It's always brought up by people who never cared, but now pretend to do so because they hate LLMs for other reasons. The irony is that some of those are actually _good_ reasons but they're too cowardly to admit them. There's nothing unmanly about admitting you're afraid of AI taking your job, becoming more intelligent, and ending up in a dystopia.

Go run the numbers and compare them vs. what it takes to produce a single hamburger or hoodie. Anyone who actually cares has already done this and drawn this conclusion.

Have you heard of "rebound effect"? Sure you can say, individually, one query is not that much... but then it becomes integrated in search engines, so suddenly when there was no queries at all, now there's 500 billions per day, and it gets included in your CICD at every commit, and soon enough in your OS, etc

"Run the numbers" means "run the numbers for using agentic coding for 2 hours per day on a frontier model" not "run the numbers for a single query". The former is the worst case scenario.

Google Search's "AI", which is what you're hinting at is such a good example. Let's say there's 10 billion Google searches per day. 10 billion completions on what is going to be a very tiny, ultra finetuned model with lots of caching (including outputs).

Check out how many queries an hour of agentic coding results in. And input/completion tokens. Estimate energy usage of Opus vs something like Gemma 4 E2B. Calculate how many developers using Opus for coding 1 hour a day would equate to those 10 billion search query originated LLM calls.

You could not have provided a better example to show that without running the numbers you'll end up with assumptions that oppose reality.

While one can raise environmental concerns about the AI datacenter buildout, I don't think it is fair to say that it "ruins the planet".

I don't think it is a good contribution to the discussion around Simon's LLM use to fix a CSS bug.

That's an interesting choice as a source. It doesn't mention climate change or human impacts at all and describes El Niño as a naturally occurring event.

> The El Nino is a phenomenon that occurs naturally

El Niño has been occurring naturally for more than 10,000 years. https://en.wikipedia.org/wiki/El_Ni%C3%B1o%E2%80%93Southern_...

The frequency and magnitude of the event is directly related to the warming up of climate

El Niño is a naturally occurring event

It was posted at 5am in New York... not sure that that was a US view, so the fact that the platform is US-owned doesn't seem so relevant, if there's a global audience.

That being said, I do agree it is a legit thought (and moreso, completely on point in the subthread discussing downsides), and that it shouldn't be downvoted.

[flagged]

I think Fable is predisposed to try and verify it's changes. Which is a very good thing. It takes a lot of prompts to get Opus to do what Fable does unprompted.

That is exactly what I would want from a junior developer - make sure the bug exists, find a way to fix it, verify the bug is fixed.

The problem, as was correctly identified in the blog post - is that instead of stopping and asking for elevated permission it relentlessly tries to find a hack on it's own. (An equivalent situation for a human developer would be needing some access to a third-party sandbox, and instead of asking a senior for credentials, tries to setup his own sandbox from scratch)

No, the problem is mostly the incorrect prompt that sent fable into a rabbit hole resulting in an incorrect solution.

Actually, it seems to me that it is just over-monetization of any impulse.

I remember when you were billed by the minute for connecting to the online world.

There were lots of incentives to keep the meter running.

is this sort of like that?

I misread your comment at first and thought you were insulting Simon Willison, rather than calling Claude Fable a bad developer, and so I'm commenting here to clarify it in case others also misread it.

That first sentence threw me off.

Anyway, I'm glad he spent the $12 because this blog post was highly informative.

This is the worst thing about current AI agents. They never ask questions. The prompt has to be pixel perfect and unambiguous or they'll happily run away doing something ridiculous.

[deleted]

Yes I agree, the solution committed is horrible, but nobody cares any more. We have entered a very strange parallel universe where because AI can work things out it's easier to take solutions that are sub optimal and just churn out (potentially) buggy features.

I care. If you can loosely point me in the direction of a better solution I'll do the extra work.

Interesting I downloaded dataset-agent and removed various different styles from the textarea (with an intention of providing a PR) including the overflow-x: hidden and I tried Safari and Chrome with both the global Mac setting of Always showing horizontal scrollbars on and off. It NEVER shows the scrollbar for me.

Do you have an extension installed that is doing something weird to your textareas? Maybe I'm doing it wrong but I think for now overflow-x is fine if you are experiencing it and I am not! Let's all get on with our lives I was probably a bit overzealous about caring all that much about a perfectly fine CSS fix.

This is missing the point, simon is a fantastic developer. but to keep track of all the nuances of the frontend frameworks and browser implementation is a lot even for great people.

it is really awesome that the final change was only a two line css change.

But the fix is wrong as pointed out by the poster...

You missed what I think is the most interesting question: why does the bug appear in Safari macOS but not in Firefox, Chrome, or WebKit running inside of Playwright?

(Dozens of people in this thread implying that any web dev should have known to solve it with overflow-x: hidden and not one of them have addressed that browser difference yet.)

I think any web dev knows not to question browser differences if it can be fixed without opening that can of worms.

Safari has some differences in default scroll behavior. I’ve seen similar bugs pop up many times.

people pay good money to not have their shit rendered via Playwright!

The 'better' fixes are often for our (human) benefit. These messy fixes serve the AI companies' interests of creating messes that need even more tokens (money) later. Bad and self-serving developers also act the same, creating tech debt

[dead]

[dead]