Is it possible able to host your website in a way so that it couldn't be found via search engines (and thus wouldn't be crawlable I hope)?
I know this has repercussions on findability, but if that wasn't a concern, I'm curious how one might circumvent getting crawled.
Sure, depends on how accessibly to people you want it to be.
Most legit search engines are going to honor robots.txt and you can disallow access.
Next level would be using something like rate limiting controls and/or Cloudflare's bot fight mode to start blocking the bad bots. You start to annoy some people here.
Next would be putting the content behind some form of auth.
I don't know why we are trusting cloudflare when they are the one creating crawlers.
https://developers.cloudflare.com/browser-run/quick-actions/...
Possible yes, probable not likely. The moment you're issued a certificate your domain will be shown in the Certificate Transparency logs which are constantly monitored from anyone who wants to find new sites.
....Yet another vector through which "security experts" has caused a waterbed problem. Let's secure the Internet, oh no! We made a centralized list of operating domains for hostile actors to guide attacks with!
robots.txt is a way of leaving the door unlocked but kindly asking bots to stay outside.
You might be interested to know that entering an unlocked door into a space you do not have permission to be in is still illegal.
You might be interested to know that the “illegality” depends on the intent. If I rest on your unlocked door handle, it opens, I enter, it’s an accident.
Sorry, what? In this scenario are you claiming that you accidentally fell inside the restricted area because you were leaning on the door? Or are you claiming that you accidentally opened the door and then walked through intentionally? In the former case, you are guilty of breaking and entering in most US jurisdictions if you don’t promptly get out. Any sane court would likely agree an accidental trespass is probably not a criminal act, but it’s not an accident if you stay. In the latter case, you’re clearly trespassing illegally.
Also this has gotten pretty far away from the web scraping scenario. There’s no door accidentally opening here.
Oops, I just accidentally fell into every website. Don't know how that happened ...
Which in a law-abiding society should be enough. It's also how we do things in the real world in many cases - i.e. here you can just write on your mailbox "no ads" and companies have to respect that.
Even when we do actually put physical locks on things they are mostly there to show that someone breaking in did so intentionally and not at all designed to prevent motivated attackers.
> here you can just write on your mailbox "no ads" and companies have to respect that
Where do you live? In the US it’s actually illegal for anyone except the USPS to deliver to a mailbox.
You could just put your website content behind its own chat interface. The crawler would just see a form input for a prompt.
If you really wanted and are interested in doing so and perhaps are even happy with just text and normal styling limitations, I recommend you to test out other protocols like creating a gemini website or gopher website. I don't think that scraping happens on even remotely the same scale there as compared to conventional websites
That being said you would require your user to download a compatible browser for gemini/gopher.