An important distinction is that blogs have their own websites and they're not required to publish full articles in their RSS feed.

Bluesky doesn't normally work that way - everything in the PDS gets replicated. They are also encouraging people to put put full blog posts in the PDS for easy replication. So, anyone who wants to index it gets a copy and you have no control over what they do.

You don't have to do it that way, though. You can publish your blog on your own website and just publish links to it on Bluesky.

> So, anyone who wants to index it gets a copy and you have no control over what they do.

How does this differ from scrapers hitting the blog directly?

Web pages aren’t digitally signed, aren’t necessarily indexed by search engines, and there are ways to block bots with things like captchas. You also have much more control over the UI. If your blog has comments, you can moderate them, for better or worse.

With a PDS, the replication happens first, before anyone reads it, and the UI is out of your control.

Maybe that’s okay, but people should understand the tradeoffs.

> Web pages aren’t digitally signed, aren’t necessarily indexed by search engines

Neither of these prevent scraping, and the lack of the first one actually makes it worse because every scraper has to go to the original server and bog it down instead of getting it from anyone with a copy of the data that they can verify using the signature.

> there are ways to block bots with things like captchas

These don't work if you have anything resembling high value content, because AI can solve them now or do the same proof of work as a real user when all they need is to get a few hundred articles once. If they want it enough they can also pay someone in a low income country to download them manually. Fundamentally if you post something that any human can access then someone can copy it. Public is public.

And if the content is the equivalent of blog comment posts, they can probably still get it, but in that case why even care if they do? Notice that this is the same thing that happens on the centralized services, e.g. Facebook uses your Facebook posts to train AI.

How can people control the comments about their blog post on Hacker News? I think my example is closer to what happen in PDS and App View.

Honestly that’s just as much because atproto is a raw data protocol. Putting an http frontend on an atproto account is something we encourage and a lot of folks do. I do that on pfrazee.com for instance, and my leaflet blogposts (which are canonically on atproto) render on my blog.