Hacker News

I don't know about this. This means I'd get sued for using a feed reader on Codeberg[1], or for mirroring repositories from there (e.g. with Forgejo), since both are automated actions that are not caused directly by a user interaction (i.e. bots, rather than user agents).

[1]: https://codeberg.org/robots.txt#:~:text=Disallow:%20/.git/,....

gausswho 6 days ago [ - ]

To be more specific, if we assume good faith upon our fine congresspeople to craft this well... ok yeah, well for hypothetical case I'll continue...

What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.

Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.

quectophoton 6 days ago [ - ]

Yeah, my main worry here is how we define the unwanted traffic, and how that definition could be twisted by bigcorp lawyers.

If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.

If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".

Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.

I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c

lucb1e 6 days ago [ - ]

You don't get sued for using a service as it is meant to be used (using an RSS reader on their feed endpoint; cloning repositories that it is their mission to host). It doesn't anger anyone so they wouldn't bother trying to enforce a rule, and secondly it's a fruitless case because the judge would say it's not a reasonable claim they're making

Robots.txt is meant for crawlers, not user agents such as a feed reader or git client

quectophoton 6 days ago [ - ]

I agree with you, generally you can expect good faith to be returned with good faith (but here I want to make heavy emphasis that I only agree on the judge part iff good faith can be assumed and the judge is informed enough to actually be able to make an informed decision).

But not everyone thinks that's the purpose of robots.txt. Example, quoting Wikipedia[1] (emphasis mine):

> indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Quoting the linked `web robots` page[2]:

> An Internet bot, web robot, robot, or simply bot, is a software application that runs automated tasks (scripts) on the Internet, usually with the intent to imitate human activity, such as messaging, on a large scale. [...] The most extensive use of bots is for web crawling, [...]

("usually" implying that's not always the case; "most extensive use" implying it's not the only use.)

Also a quick HN search for "automated robots.txt"[3] shows that a few people disagree that it's only for crawlers. It seems to be only a minority, but the search results are obviously biased towards HN users, so it could be different outside HN.

Besides all this, there's also the question of whether web scraping (not crawling) should also be subject to robots.txt or not; where "web scraping" includes any project like "this site has useful info but it's so unusable that I made a script so I can search it from my terminal, and I cache the results locally to avoid unnecessary requests".

The behavior of alternative viewers like Nitter could also be considered web scraping if they don't get their info from an API[4], and I don't know if I'd consider Nitter the bad actor here.

But yeah, like I said I agree with your comment and your interpretation, but it's not the only interpretation of what robots.txt is meant for.

[1]: https://en.wikipedia.org/wiki/Robots.txt

[2]: https://en.wikipedia.org/wiki/Internet_bot

[3]: https://hn.algolia.com/?dateRange=all&query=automated%20robo...

[4]: I don't know how Nitter actually works or where does it get its data from, I just mention it so it's easier to explain what I mean by "alternative viewer".

blibble 6 days ago [ - ]

> This means I'd get sued for using a feed reader on Codeberg

you think codeberg would sue you?

quectophoton 6 days ago [ - ]

Probably not.

But it's the same thing with random software from a random nobody that has no license, or has a license that's not open-source: If I use those libraries or programs, do I think they would sue me? Probably not.