As a user I'm worried about a + b sure. As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?

Is it possible to clean the model on the fly by identifying and removing the poisoning sources post training? Or do you have to start from scratch?

  > As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?
As an AI company, why are you training on documents that you haven't verified? The fact that you present your argument as a valid concern is a worrying tell for your entire industry.

AI companies gave up on verification years ago. It’s impossible to verify such intense scraping.

not really our problem though is it?

If you are a user of AI tools then it is a problem for you too. If you are not a user of AI tools then this does not impact you. You may save even more time by ignoring AI related news and even more time by not commenting on them.

Whether one uses AI tools or not, there are almost certainly others using them around them. AI tools are ubiquitous now.

It certainly does impact you if nearly everyone else is using them.

[deleted]

Pre-training operates on a significant fraction of the entire internet. It’s simply not possible.

> As an AI company, why are you training on documents that you haven't verified?

Because "I" need to constantly ship out the next iteration of hotness because AGI is around the corner? Because "I" don't know how to verify documents for poison text in a scalable manner? Because "I" don't care? I am not an AI company, how would I know?

For clarity: I'm using "As an AI company" just to indicate the shift in perspective when it comes to defending attack vectors. Not literally indicating that I am (or affiliated with) an AI company.

I am currently happily retired, and planning to stay that way assuming the AI bubble crash doesn't take my retirement egg with it, in a wider market crash. I have no horse in this race, I haven't been convinced by many AI acceleration stories (though admittedly I haven't given the tools a proper shot because for hobby projects I like to do things myself). And it's definitely not my (entire) industry. So completely wrong read on many levels there, friend.

[deleted]