I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!

I've also had enormous luck with Anubis. AI scrapers found my personal Forgejo server and were hitting it on the order of 600K requests per day. After setting up Anubis, that dropped to about 100. Yes, some people are going to see an anime catgirl from time to time. Bummer. Reducing my fake traffic by a factor of 6,000 is worth it.

As someone on the browsing end, I love Anubis. I've only seen it a couple of times, but it sparks joy. It's rather refreshing compared to Cloudfare, which will usually make me immediately close the page and not bother with whatever content was behind it.

It really reminds me of old Internet, when things were allowed to be fun. Not this tepid corporate-approved landscape we have now.

Anubis is simple; recaptcha and the like are huge opaque spaghetti.

Same here, really. That's why I started using it. I'd seen it pop up for a moment on a few sites I'd visited, and it was so quirky and completely not disruptive that I didn't mind routing my legit users through it.

So maybe there are more people who like the “anime catgirl” than there are who think it’s weird

*anime jackalgirl ;-)

Quite possibly. Or, in my case, I think it's more quirky and fun than weird. It's non-zero amounts of weird, sure, but far below my threshold of troublesome. I probably wouldn't put my business behind it. I'm A-OK with using it on personal and hobby projects.

Frankly, anyone so delicate that they freak out at the utterly anodyne imagery is someone I don't want to deal with in my personal time. I can only abide so much pearl clutching when I'm not getting paid for it.

The Digital Research Alliance of Canada (the main organization unifying and handling all the main HPC compute clusters in Canada) now uses Anubis for their wiki. Granted this is not a business, but still!

https://docs.alliancecan.ca/wiki/Technical_documentation

For what its worth, I think that a UN/(Unicef?) website (not sure which one) did use anubis so maybe you can put it behind businesses too :)

Anyone is free to replace the cat girl with an actual cat or a vintage computer logo or whatnot anyway.

My issue is that it blocks away people using browsers without javascript.

How can one do this? Did not find it in the docs

It’s a feature in the paid version, or I guess you could recompile it if you didn’t want to pay (but my guess is if you want to change the logo you can probably pay).

[deleted]

The 3 images are in the repo, you can replace them and rebuild or point to other ones in the templates.

As someone on the hosting end, Anubis has unfortunately been overused and thus scrapers, especially Huawei ones, bypass it. I've gone for go-away instead which is similar but more configurable in challenges

My experience with it is that it somehow took 20 seconds to load (site might've been hn-hugged at the time), only to "protect" some fucking static page instead of just serving that shit in the first place rather than wasting CPU on... whatever it was doing to cause delay

Same experience for me. I tried it on a low-end smartphone and the Anubis challenge took about 45 seconds to complete.

Reminds me of weird furry porn, I can't say I like it

yes, very true! Anubis is a hell lot better than cloudflare turnstile or its older cousin sister google recaptcha.

Yep, Anubis-chan is super cute! :)

That’s so many scrapers. There must be a ton of companies with very large document collections at this point, and it really sucks that they don’t at least do us the courtesy of indexing them and making them available for keyword search, but instead only do AI.

It’s kind of crazy how much scraping goes on and how little search engine development goes on. I guess search engines aren’t fashionable. Reminds me of this article about search engines disappearing mysteriously: https://archive.org/details/search-timeline

I try to share that article as much as possible, it’s interesting.

So! Much! Scraping! They were downloading every commit multiple times, and fetching every file as seen at each of those commits, and trying to download archives of all the code, and hitting `/me/my-repo/blame` endpoints as their IP's first-ever request to my server, and other unlikely stuff.

My scraper dudes, it's a git repo. You can fetch the whole freaking thing if you wanna look at it. Of course, that would require work and context-aware processing on their end, and it's easier for them to shift the expense onto my little server and make me pay for their misbehavior.

Crazy

Or some anti-ddos/bot companies using ultra cheap scraping services to annoy you enough to get you into their "free" anti bot protection, so they can charge the few real ai scrapers for access to your site.

Is there any evidence that this has actually happened?

Even if there isn't (yet?), there's probably someone who's honestly thinking this is potentially a viable business model and at least napkin-mathing it out.

My napkin mathing is that their ROI would be negative. That's a lot of compute and bandwidth they'd have to pay for even if they were just throwing away the results.

So, it hasn't happened, and you're just making stuff up.

But there is a lot of search engine development going on, it's just that the results of the new search engines are fed straight into AI instead of displayed in the legacy 10-links-per-page view.

Just block all the big hosters IP ranges, when they ignore robots.txt.

For fun add long timeouts and huge content sizes. No private individual will browse from there, and all scrapers will do.

> There must be a ton of companies with very large document collections at this point

See, I don't think there is, I don't think they want that expense. It's basically the Linus Torvalds philosophy of data storage, if it's on the Internet, I don't need a backup. While I have absolutely no proof of this, I'd guess that many AI companies just crawl the Internet constantly, never saving any of the data. We're seeing some of these scrapers go to great length attempting to circumvent any and all forms of caching, they aren't interested in having a two week old copy of anything.

Where did Linus Torvalds expressed this philosophy I have never seen it

> Where did Linus Torvalds expressed this philosophy I have never seen it

https://www.goodreads.com/quotes/574706-only-wimps-use-tape-...

Could be. Can you train a model without saving things though?

It's actually a well established concept: https://youtu.be/p9KeopXHcf8

*anime jackalgirl

Also you mentioned Anubis, so it’s creator will probably read this. Hi Xena!

Ohai! I'm working on dataset poisoning. The early prototype generates vapid LinkedIn posts but future versions will be fully pluggable with WebAssembly.

Now I'm picturing an AI trained exclusively on LinkedIn posts. One could probably sell that model to an online ad agency for a pretty penny.

And thus AM was born. Woe to us.

Hi Xena! Your blog is amazing! Didn't realize you're working on Anubis - it's a really nice tool for the internet! Reminds me a bit of the ye' olde internet for some reason.

You've made one of the best solutions, that matched what I thought of implementing myself, and at the time it was most needed. I think a couple of "thank you" are sorely missing in this comment section.

Thank you!

That sounds fun, I look forward to reading a writeup about that

So I can plan it, how much detail do you want? Here's what I have about the prototype: https://anubis.techaro.lol/docs/admin/honeypot/overview

Probably any detail that you think is cool, I would be interested in reading about. When in doubt err on the side of too much detail.

That was a good read. I hadn’t heard of spintax before, but I’ve thought of doing things like that. Also “pseudoprofound anti-content”, what a great term, that’s hilarious!

This is amazing, I was just wondering about if it's possible to tie anubis together with iocaine, but it seems you already thought of that.

It's slightly different in subtle ways. If I recall iocaine makes you configure a subprocess that it executes to generate garbage. One rule I have for Anubis in the code is that fork()/exec() are banned. So the pluggable garbage generator is gonna be powered by CGI handlers compiled to WebAssembly. It should be fun!

As the owner of honeypot.net, I always appreciate seeing the name used as intended out in the wild.

what do people use to get keyword alerts in HN?

I think that most people don't do this, and the ones that do have custom solutions. Xena's uses cron, but that's all I know. It's probably a custom shell script.

Correct; my bad!

And hey, Xena! (And thank you very much!)

I checked Xe's profile when I hadn't seen them post here for a while. According to that, they're not really using HN anymore.

See this thread from yesterday or so: https://news.ycombinator.com/item?id=46302496#46306025

[dead]

>I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!

An even more insane idea -- minding the idea here is porn is radioactive to AI data training scrapers -- is there is something the powers that be view as far more disruptive and against community guidelineish than porn. And that would be wrongthink. The narratives. The historic narratives. The woke ideology. Anything related to an academic department whose field is <population subgroup> studies. Alls you need to do is plop in a little diatribe staunchly opposing any such enforced views and that AI bot will shoot away from your website and lightspeed

I'm afraid AI bot and scraper are different things. Looks like poison is filtered after scraping no matter where it comes from, so there's no need to disable scraping you, because that's extra work.

I like this better than of NSFW links; just include a (possible LLM generated) paragraph about not supporting transitions in minor children. Or perhaps that libraries that remove instructional booklets for how to have same-sex intercourse aren't actually banning the books.

That sort of thing; nothing that 80% of people object to (so there's no problem if someone actually sees it), but something that definitely triggers the filters.

[flagged]

Which cartoon are you referring to? The version of Anubis I installed only has the G-rated default images.

[flagged]

I'm being sincere here: I genuinely don't know what you're talking about.

I'm referring to these default images: https://github.com/TecharoHQ/anubis/tree/main/docs/static/im.... Do you mean something different?

Similar but yeah. Whatever prompts during the challenge . It’s creepy , out of context and inappropriate .

If you keep referring to non-explicit material as pornography, you will continue to confuse people.

If you have an objection to the image other than it’s pornographic status, please word it clearly.

I was clear on the issue