IMO: Its unacceptable that Anthropic be allowed the final say in what "safety" means for their products, and its extremely reasonable that the USG be allowed that say, for Americans. In other words: Anthropic cannot be allowed to distribute an unsafe product. It doesn't matter how much they "tried" to make it safe, by their own definition of safe.
That's separate from the question of whether Fable 5 and Mythos 5 are unsafe. I don't really know. Here's a few things that seem real, though: These models probably have some level of capability to assist with bioterrorism, Anthropic has self-admitted that their own safety measures are imperfect [1], so it should come as no surprise that jailbreaks seem far more possible than Anthropic is leading you to believe in this blog post [2].
[1] https://www.anthropic.com/news/fable-mythos-access: "We suspect that perfect jailbreak resistance is not currently possible for any model provider."
[2] https://x.com/elder_plinius/status/2064776322979676227
If Amazon sold a book that taught someone how to commit bioterrorism, would there be action against them to stop selling it? Its an imperfect analogy, but the parallels are there. LLMs don't get a free pass because they're also so good at writing typescript for beige CRUD apps and bedtime stories.
One thing I hope we align on: Synthetic safeguards (steering, rejections, etc) on top of models to block illegal/sensitive topics isn't good enough. Anthropic has self-admitted that it isn't good enough. We need the technology to lobotomize these capabilities the public deems too unsafe to allow out of the models at the most fundamental level. And, we need to align on what the scope of these forbidden fruit topics are. This is, actually, the only way open source continues to thrive. I want open source models to thrive, but they won't be allowed to thrive, nor should we want them to thrive, if they're teaching people how to engineer novel viruses and other horrible stuff.
> LLMs don't get a free pass because they're also so good at writing typescript for beige CRUD apps and bedtime stories.
Plenty of useful things get free passes to be dangerous. Traffic accidents are the leading cause of accidental deaths in 11 states, but we don't ban cars because they're dangerous. There's plenty of safety features, but we acknowledge and accept that people will die. People like to pretend that they won't sacrifice safety for convenience, but they continue to do it time and time again.
Yeah, and in those cases there is strong governmental regulation surrounding what a "safe car" is. We don't have that with AI. Another analogy is with weapons manufacturing; American weapons manufacturers have some, but relatively little, regulation when it comes to selling weapons to Americans; but they are subject to significant regulation when it comes to shipping those weapons overseas.
We need legislation that empowers a Federal agency adjacent to the CDC or FTC with the power to enumerate specific capabilities models could exhibit that we deem dangerous, and require model manufacturers to guarantee that their models cannot exhibit these capabilities. The reality is, zero of the safeguard systems invented by the frontier labs today are sophisticated enough to do this. The labs are extremely, extremely bad at Safety, relative to both how much impact their products are having on the world and how good other industries like Medicine and Manufacturing have gotten at safety. I'm talking a total and complete culture shift.
Google had a "appoint Eric Schmidt as CEO" moment like this. There were kids running it before; make them rich, and retire them to an island somewhere, because the impact your products are having on the world cannot possibly be allowed to continue with your current leadership at the helm. Dario is that problem now. I think Sam/OpenAI can adapt and mature. I have zero faith that Dario and the furry EA cult in Anthropic leadership will. So, this is what being unable to mature looks like; the public will make your company safe, one way or another.