Hacker News

My website has a "public" directory for things I wanted to be publicly accessible, but there's an index.html there so you can't trivially discover what files are there.

I was mirroring this to another machine, and I'd foolishly already configured haproxy to serve from either machine. In the 10 minutes (guesstimate) between starting to populate the directory and closing off access, I had one bot scrape every single file in there. What was worse was that it'd cached the index, and continued scraping these files from the other backend too, so I had to shut off access to that directory on the live server as well as the backup.

Whilst technically, all of these files were "public" because I wanted to share them, some of them weren't publicly linked to (things like my CV etc) which I'd rather hadn't been scraped by some random bot (that of course, didn't identify itself or respect robots.txt).