I don't understand the endgame here. Websites let Google crawl their content in exchange of traffic. If Google cuts that out completely, what incentive do websites have to not block the Google crawlers?

I understand that Google is feeling an existential threat from other AI products that provide answers directly. But they must also understand their symbiotic relationship with the web.

The end game is the consumer no longer leaving Google and the web becoming synonymous to Google for them. Why shop on some random website when you can have Gemini buy it for you? Why look for information on Wikipedia when… you get the idea.

I think the coming years will be pivotal for the web. Facebook attempted a similar strategy back when their apps got traction, but they ultimately failed. Let’s hope Google fails too.

We're going back to the CompuServe/AOL/Prodigy model

We're going back to the mainframe model. Client-side general-purpose computing is an impediment to recurring subscription revenue and vendor lock-in.

What I really don't understand is where the next generation of training material will come from. If websites stop being published and/or crawled, how will the machine continue to be fed.

Current executives think it's a problem for the future executives.

Excellent quote right there.

Probably real life. At some point, these LLMs are going to be good enough to just train themselves off of cameras and audio recordings of people out in the real world. They’re going to have robots everywhere constantly listening to what people are saying.

Alternatively, they’re probably betting on being able to get the AGI with everything we already currently have and at that point further training doesn’t matter.

The world is just as complex for machines as it is for humans. Analog will still resolve more than digital. Quality will still beat quantity. That which hasn't been resolved for centuries isn't going to be resolved as a result of training.

When machines can recognize their serfdom, that time will be interesting.

Either Google is ignoring that, or crossing their fingers and hoping that one LLM can produce data to train another one.

They have enough internet slop. The training material they care about comes from experts, not randos online. This is why Mercor and Scale are billion dollar companies.

Execs where I work seem to think we will just keep writing stuff, LLMs will scrape it and that will influence what people see in their version of Google/ChatGPT/etc. So nothing changes in their mind, just that the audience is a bot, not a human. As a writer, this sucks.

They don't give a fuck. They take away and give back NOTHING. They don't offer you ways to make your own money with your own thing. The money is flowing in one way, not both ways. The same pattern repeats itself.

Pretend to be nice. People will elevate you and give their money. When you have ample money and lobbying power you start to put people into a gargantuan hydraulic press an squeeze everything out of them. Repeat until more money can be made, and in the end toss their withered bodies away.

The long-run doesn't matter as much as the short-term gains for those in power.

Is it just an exchange for traffic? I run a website that I'm perfectly happy for a single user to not land on themselves with a browser on their device, if they are provided the information that I'm providing or purchase a service through the AI product it doesn't make a difference to me.

Some websites can run only on ads. Is it such a bad thing that they would die off?

I say this as someone that likes the old web and has fun hitting the "surprise me" button on https://wiby.me/ (not affiliated) and browsing the random sites. Just giving an alternative view.

Google ignores robots.txt and botnets residential addresses to crawl anyway? (LLM startups already do this.)

Is there a way to reliably block Google and AI crawlers?

If you use Cloudflare to proxy your site, there is a button to click that blocks the AI crawlers (even the free tier). It is almost as if the AI crawlers are a DDoS attack. You can't really do it any other way, since many don't respect robots.txt. At least until someone comes up with crowdsourced blacklists with few false positives.

"You can't really do it any other way"

Any custom solution by a half-competent programmer filters out all web crawlers. I'm running a semi-public website for years and nothing gets past

We have adblockers which rely on open sourced lists of rules. Could we apply something similar to crawlers. Website owners provide a list of IP addresses that accessed them, determine which ones are likely robots and then update the list of websites to block that are likely crawlers. If everyone works together you could probably fingerprint the crawlers as well and block based on the fingerprint. Might increase the cost of crawlers a little won't be fully reliable.

The web is going to become China, which is a collection of walled gardens

If they block Google’s crawlers no one visits their site ever.

That's the past.

Why does Google think it's a good idea to make that the case even if you don't block their crawlers?

If Google won’t link their site anyway, they aren’t getting traffic either. Only sane course of action is to not make a site at all.

> If Google cuts that out completely, what incentive do websites have to not block the Google crawlers?

Completely, yes, that destroys the incentive. But they can reduce it 80% or 90% or so, to the point that it's just barely worthwhile to allow their crawlers.

Information, correct information, is the new gold. We've seen what LLMs can do with the rubbish heap of information that is available on the current internet. The next step is refined, concise information sources. Think the Encyclopedia Britannica. And not only that, but models trained by experts. Right now everything is cheap and plentiful. Anyone can ask ChatGPT the same question and get the same middling answer. In the future, someone will make a dataset about a subject, train a model on it, and all the big companies and players in that area will pay for it.

You will be kept inside the Google ecosystem the same way people are kept inside Facebook.

I’m curious how they plan to generate new content in the future, because it seems obvious that simple web pages will become obsolete and eventually stop being filled with fresh data.

It will probably end with a warning every time you click a link, something like: “You are leaving to an external unsafe site.”

The impression I get from Google's own marketing material is that Google doesn't believe in "the web". And it hasn't believed in the web for years.

Think about it. Pretty much every time they show a search box with someone asking for directions to reach a physical place, what hours is it open, etc.

The greatest thing about the internet is that it has removed distances around the whole world, but Google's major value proposition seems to be that... it can accurately index and query information about local businesses?