Hacker News

kstrauser 4 days ago [ - ]

I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!

I've also had enormous luck with Anubis. AI scrapers found my personal Forgejo server and were hitting it on the order of 600K requests per day. After setting up Anubis, that dropped to about 100. Yes, some people are going to see an anime catgirl from time to time. Bummer. Reducing my fake traffic by a factor of 6,000 is worth it.

anonymous908213 4 days ago [ - ]

As someone on the browsing end, I love Anubis. I've only seen it a couple of times, but it sparks joy. It's rather refreshing compared to Cloudfare, which will usually make me immediately close the page and not bother with whatever content was behind it.

teeray 4 days ago [ - ]

It really reminds me of old Internet, when things were allowed to be fun. Not this tepid corporate-approved landscape we have now.

GoblinSlayer 2 days ago [ - ]

Anubis is simple; recaptcha and the like are huge opaque spaghetti.

kstrauser 4 days ago [ - ]

Same here, really. That's why I started using it. I'd seen it pop up for a moment on a few sites I'd visited, and it was so quirky and completely not disruptive that I didn't mind routing my legit users through it.

n1xis10t 4 days ago [ - ]

So maybe there are more people who like the “anime catgirl” than there are who think it’s weird

kstrauser 4 days ago [ - ]

*anime jackalgirl ;-)

Quite possibly. Or, in my case, I think it's more quirky and fun than weird. It's non-zero amounts of weird, sure, but far below my threshold of troublesome. I probably wouldn't put my business behind it. I'm A-OK with using it on personal and hobby projects.

Frankly, anyone so delicate that they freak out at the utterly anodyne imagery is someone I don't want to deal with in my personal time. I can only abide so much pearl clutching when I'm not getting paid for it.

D-Machine 3 days ago [ - ]

The Digital Research Alliance of Canada (the main organization unifying and handling all the main HPC compute clusters in Canada) now uses Anubis for their wiki. Granted this is not a business, but still!

https://docs.alliancecan.ca/wiki/Technical_documentation

Imustaskforhelp 4 days ago [ - ]

For what its worth, I think that a UN/(Unicef?) website (not sure which one) did use anubis so maybe you can put it behind businesses too :)

prmoustache 3 days ago [ - ]

Anyone is free to replace the cat girl with an actual cat or a vintage computer logo or whatnot anyway.

My issue is that it blocks away people using browsers without javascript.

stefanka 3 days ago [ - ]

How can one do this? Did not find it in the docs

easton 3 days ago [ - ]

It’s a feature in the paid version, or I guess you could recompile it if you didn’t want to pay (but my guess is if you want to change the logo you can probably pay).

3 days ago [ - ]

[deleted]

prmoustache 3 days ago [ - ]

The 3 images are in the repo, you can replace them and rebuild or point to other ones in the templates.

acheong08 4 days ago [ - ]

As someone on the hosting end, Anubis has unfortunately been overused and thus scrapers, especially Huawei ones, bypass it. I've gone for go-away instead which is similar but more configurable in challenges

PunchyHamster 4 days ago [ - ]

My experience with it is that it somehow took 20 seconds to load (site might've been hn-hugged at the time), only to "protect" some fucking static page instead of just serving that shit in the first place rather than wasting CPU on... whatever it was doing to cause delay

timpera 3 days ago [ - ]

Same experience for me. I tried it on a low-end smartphone and the Anubis challenge took about 45 seconds to complete.

brettermeier 3 days ago [ - ]

Reminds me of weird furry porn, I can't say I like it

opem 3 days ago [ - ]

yes, very true! Anubis is a hell lot better than cloudflare turnstile or its older cousin sister google recaptcha.

m4rtink 4 days ago [ - ]

Yep, Anubis-chan is super cute! :)

n1xis10t 4 days ago [ - ]

That’s so many scrapers. There must be a ton of companies with very large document collections at this point, and it really sucks that they don’t at least do us the courtesy of indexing them and making them available for keyword search, but instead only do AI.

It’s kind of crazy how much scraping goes on and how little search engine development goes on. I guess search engines aren’t fashionable. Reminds me of this article about search engines disappearing mysteriously: https://archive.org/details/search-timeline

I try to share that article as much as possible, it’s interesting.

kstrauser 4 days ago [ - ]

So! Much! Scraping! They were downloading every commit multiple times, and fetching every file as seen at each of those commits, and trying to download archives of all the code, and hitting `/me/my-repo/blame` endpoints as their IP's first-ever request to my server, and other unlikely stuff.

My scraper dudes, it's a git repo. You can fetch the whole freaking thing if you wanna look at it. Of course, that would require work and context-aware processing on their end, and it's easier for them to shift the expense onto my little server and make me pay for their misbehavior.

n1xis10t 4 days ago [ - ]

Crazy

PeterStuer 4 days ago [ - ]

Or some anti-ddos/bot companies using ultra cheap scraping services to annoy you enough to get you into their "free" anti bot protection, so they can charge the few real ai scrapers for access to your site.

throw10920 3 days ago [ - ]

Is there any evidence that this has actually happened?

zhengyi13 3 days ago [ - ]

Even if there isn't (yet?), there's probably someone who's honestly thinking this is potentially a viable business model and at least napkin-mathing it out.

kstrauser 3 days ago [ - ]

My napkin mathing is that their ROI would be negative. That's a lot of compute and bandwidth they'd have to pay for even if they were just throwing away the results.

throw10920 3 days ago [ - ]

So, it hasn't happened, and you're just making stuff up.

miki123211 4 days ago [ - ]

But there is a lot of search engine development going on, it's just that the results of the new search engines are fed straight into AI instead of displayed in the legacy 10-links-per-page view.

rurban 2 days ago [ - ]

Just block all the big hosters IP ranges, when they ignore robots.txt.

For fun add long timeouts and huge content sizes. No private individual will browse from there, and all scrapers will do.

mrweasel 4 days ago [ - ]

> There must be a ton of companies with very large document collections at this point

See, I don't think there is, I don't think they want that expense. It's basically the Linus Torvalds philosophy of data storage, if it's on the Internet, I don't need a backup. While I have absolutely no proof of this, I'd guess that many AI companies just crawl the Internet constantly, never saving any of the data. We're seeing some of these scrapers go to great length attempting to circumvent any and all forms of caching, they aren't interested in having a two week old copy of anything.

kelvinjps10 3 days ago [ - ]

Where did Linus Torvalds expressed this philosophy I have never seen it

lelanthran 3 days ago [ - ]

> Where did Linus Torvalds expressed this philosophy I have never seen it

https://www.goodreads.com/quotes/574706-only-wimps-use-tape-...

n1xis10t 3 days ago [ - ]

Could be. Can you train a model without saving things though?

buu700 4 days ago [ - ]

It's actually a well established concept: https://youtu.be/p9KeopXHcf8

n1xis10t 4 days ago [ - ]

*anime jackalgirl

Also you mentioned Anubis, so it’s creator will probably read this. Hi Xena!

xena 4 days ago [ - ]

Ohai! I'm working on dataset poisoning. The early prototype generates vapid LinkedIn posts but future versions will be fully pluggable with WebAssembly.

mrweasel 4 days ago [ - ]

Now I'm picturing an AI trained exclusively on LinkedIn posts. One could probably sell that model to an online ad agency for a pretty penny.

Yizahi 4 days ago [ - ]

And thus AM was born. Woe to us.

tommica 4 days ago [ - ]

Hi Xena! Your blog is amazing! Didn't realize you're working on Anubis - it's a really nice tool for the internet! Reminds me a bit of the ye' olde internet for some reason.

gettingoverit 4 days ago [ - ]

You've made one of the best solutions, that matched what I thought of implementing myself, and at the time it was most needed. I think a couple of "thank you" are sorely missing in this comment section.

Thank you!

n1xis10t 4 days ago [ - ]

That sounds fun, I look forward to reading a writeup about that

xena 4 days ago [ - ]

So I can plan it, how much detail do you want? Here's what I have about the prototype: https://anubis.techaro.lol/docs/admin/honeypot/overview

n1xis10t 4 days ago [ - ]

Probably any detail that you think is cool, I would be interested in reading about. When in doubt err on the side of too much detail.

That was a good read. I hadn’t heard of spintax before, but I’ve thought of doing things like that. Also “pseudoprofound anti-content”, what a great term, that’s hilarious!

63stack 4 days ago [ - ]

This is amazing, I was just wondering about if it's possible to tie anubis together with iocaine, but it seems you already thought of that.

xena 3 days ago [ - ]

It's slightly different in subtle ways. If I recall iocaine makes you configure a subprocess that it executes to generate garbage. One rule I have for Anubis in the code is that fork()/exec() are banned. So the pluggable garbage generator is gonna be powered by CGI handlers compiled to WebAssembly. It should be fun!

kstrauser 4 days ago [ - ]

As the owner of honeypot.net, I always appreciate seeing the name used as intended out in the wild.

ramonga 4 days ago [ - ]

what do people use to get keyword alerts in HN?

n1xis10t 3 days ago [ - ]

I think that most people don't do this, and the ones that do have custom solutions. Xena's uses cron, but that's all I know. It's probably a custom shell script.

kstrauser 4 days ago [ - ]

Correct; my bad!

And hey, Xena! (And thank you very much!)

ziml77 4 days ago [ - ]

I checked Xe's profile when I hadn't seen them post here for a while. According to that, they're not really using HN anymore.

n1xis10t 4 days ago [ - ]

See this thread from yesterday or so: https://news.ycombinator.com/item?id=46302496#46306025

GaryBluto 3 days ago [ - ]

[dead]

amypetrik8 3 days ago [ - ]

>I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!

An even more insane idea -- minding the idea here is porn is radioactive to AI data training scrapers -- is there is something the powers that be view as far more disruptive and against community guidelineish than porn. And that would be wrongthink. The narratives. The historic narratives. The woke ideology. Anything related to an academic department whose field is <population subgroup> studies. Alls you need to do is plop in a little diatribe staunchly opposing any such enforced views and that AI bot will shoot away from your website and lightspeed

GoblinSlayer 2 days ago [ - ]

I'm afraid AI bot and scraper are different things. Looks like poison is filtered after scraping no matter where it comes from, so there's no need to disable scraping you, because that's extra work.

lelanthran 3 days ago [ - ]

I like this better than of NSFW links; just include a (possible LLM generated) paragraph about not supporting transitions in minor children. Or perhaps that libraries that remove instructional booklets for how to have same-sex intercourse aren't actually banning the books.

That sort of thing; nothing that 80% of people object to (so there's no problem if someone actually sees it), but something that definitely triggers the filters.

tonymet 3 days ago [ - ]

[flagged]

kstrauser 3 days ago [ - ]

Which cartoon are you referring to? The version of Anubis I installed only has the G-rated default images.

tonymet 3 days ago [ - ]

[flagged]

kstrauser 3 days ago [ - ]

I'm being sincere here: I genuinely don't know what you're talking about.

I'm referring to these default images: https://github.com/TecharoHQ/anubis/tree/main/docs/static/im.... Do you mean something different?

tonymet 3 days ago [ - ]

Similar but yeah. Whatever prompts during the challenge . It’s creepy , out of context and inappropriate .

n1xis10t 3 days ago [ - ]

If you keep referring to non-explicit material as pornography, you will continue to confuse people.

If you have an objection to the image other than it’s pornographic status, please word it clearly.

tonymet 3 days ago [ - ]

I was clear on the issue