Hacker News

sReinwald 2 months ago [ - ]

The sheer audacity here is quite something. You're stating people can't use your scraped data for commercial purposes "without permission," while your entire project is built on vacuuming up content from countless users without their permission, and in direct violation of Discord's ToS. That's not just a double standard; it's bordering on next-level cognitive dissonance.

And "privacy preserving"? With a one-click opt-out, that 99.999% of the affected users will never even know exists because they have no idea their conversations are now part of your archive, and you want it indexed by search engines? That's not "privacy preserving" - that's a bad joke. If privacy was a genuine concern, this project wouldn't exist in its current form. What you're offering is an opt-out fig leaf for a mass data harvesting operation.

Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context. It's a fundamental misunderstanding (or willful dismissal) of user expectations on what is essentially a semi-public, yet distinctly siloed, platform. This isn't an open-web forum where content is implicitly intended for broad public consumption and indexing.

Look, I get the frustration that (likely) motivated this. Discord has become an information black hole for many communities, and the shift away from open, searchable forums for project support is a genuine problem I've been incredibly frustrated with myself. But this "solution" - creating a massive, non-consensual archive that tramples over user privacy (and platform terms) - creates far graver ethical and practical issues than the one it purports to solve.

xk_id a month ago [ - ]

Honestly, maybe they should. Maybe we need more stuff like this, until people finally wake up about the privacy catastrophe. The now defunct service spy.pet used to sell this kind of data with the stated purpose of doxxing people. There’s black markets for this. And it’s the same kind of data the service providers themselves have full access to.

searchcord 2 months ago [ - ]

> The sheer audacity here is quite something. You're stating people can't use your scraped data for commercial purposes "without permission," while your entire project is built on vacuuming up content from countless users without their permission, and in direct violation of Discord's ToS. That's not just a double standard; it's bordering on next-level cognitive dissonance.

Not really, it is not free to host and serve this data. If they want to get the data for free, they can get it directly from Discord. I did that work for them.

> And "privacy preserving"? With a one-click opt-out, that 99.999% of the affected users will never even know exists because they have no idea their conversations are now part of your archive, and you want it indexed by search engines? That's not "privacy preserving" - that's a bad joke. If privacy was a genuine concern, this project wouldn't exist in its current form. What you're offering is an opt-out fig leaf for a mass data harvesting operation.

Again, not really. It's impossible to search for users without already knowing what server they are in. This is functionally identical to Discord's in-built search feature.

> Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context. It's a fundamental misunderstanding (or willful dismissal) of user expectations on what is essentially a semi-public, yet distinctly siloed, platform. This isn't an open-web forum where content is implicitly intended for broad public consumption and indexing.

I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized. This also happens on platforms like Telegram, look at the SangMata_BOT for example. Unless the messages are end to end encrypted, it was just a matter of time before they were scooped up and archived.

Thanks for your input, though, I really do want to build a platform that balances privacy and usability.

belst 2 months ago [ - ]

> I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized. Unless the messages are end to end encrypted, it was just a matter of time before they were scooped up and archived.

and that makes it ok for you to do aswell? Bots storing all the messages is also not ok, but they also don't publish it, so it is way less problematic

sReinwald a month ago [ - ]

Okay, the "not really" and "I'll solve that problem if and when" responses are... something else. It feels like you're speedrunning how to get into a world of trouble while hand-waving away every legitimate concern. Let's try to unpack this again, because your justifications are frankly baffling.

> Again, not really. It's impossible to search for users without already knowing what server they are in. This is functionally identical to Discord's in-built search feature.

That's not quite correct, and frankly it borders on willful obfuscation. In your own words elsewhere in this thread, you're eager for search engines to index this archive. That "privacy preserving" barrier of needing to know both a user ID and a server/channel id evaporates the moment Google or any other search engine hoovers up your pages. At that point, any combination of keywords, usernames, aliases, or snippets could reveal someone's posting history, across contexts and years. How is that "functionally identical" to Discord's walled-garden search or "privacy preserving"?

> I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized.

This is a disingenuous deflection.

  - Moderation bots operate within a specific server, for a specific purpose (moderation, utility) defined by the server admins. Their logs are typically for admin/moderator use, not for creating a global, publicly searchable archive.
  - Users joining a server often see these bots, understand their function, and server admins explicitly add these bots. It's a known quantity. What you're doing is orders of magnitude different - an external, uninvited entity scraping everything discoverable and making it universally public.
  - Just a matter of time" is a lazy, fatalistic excuse for unethical data harvesting. Just because something can be technically scraped doesn't mean it should be, or that you doing so is fine.

Your "I really do want to build a platform that balances privacy and usability" line sounds utterly hollow when the entire foundation of the project demonstrates a profound misunderstanding, or disregard, for basic privacy, consent, and intellectual property.

Speaking of which... have you actually thought about the legal Pandora's Box you're prying open? Your casual "I'll deal with Discord's ToS issues if they arise" attitude is quaint, because Discord's ToS is likely the tip of a colossal iceberg of legal trouble.

You're not just 'breaking ToS', you're potentially looking at:

  - Data Protection Law Violations (GDPR, CCPA, etc.) because you're scraping personal data of EU/California (and other) residents without any lawful basis. The fines can be astronomical. "Opt-out" after the fact for data you had no right to take in the first place isn't how this works.
  - COPPA Violations if you scraped any messages from a 12-year-old on a "public, discoverable" server before their account was deleted by Discord. Guess who's holding that data now without parental consent? You.
  - Every original, creative message is copyrighted by its author. Roleplay, detailed discussions, code snippets, even well-crafted tirades – you're republishing millions of these. While not every "lol" is copyrightable, a massive volume of content on Discord absolutely is. "Fair use" for wholesale, non-transformative republication on this scale? Unlikely.
  - And last but not least, CSAM (Child Sexual Abuse Material): This is the nightmare scenario. You are scraping public Discord. Some public, poorly moderated Discords inevitably contain links to or text-based CSAM. Even if you don't intend to host it, if your scraper picks it up and it becomes accessible via your archive (even just a link), you are in profoundly serious trouble. "But I don't re-publish attachments" is irrelevant if you're archiving and re-publishing the links. This isn't just fines; this is potential prison time.

Good luck with all of this.

I hope you have a good lawyer, ideally multiple. You might need them.

IShowSlow a month ago [ - ]

The COPPA part is only if it was knowingly.

areyourllySorry a month ago [ - ]

did you type this?

deakam 2 months ago [ - ]

Ridiculous take. If you're posting in a server that's intentionally open to the public and accessible to anyone with a link or even indexed by server discovery you shouldn't expect privacy. That's just the basic reality of the internet.

sReinwald a month ago [ - ]

No, what's "ridiculous" is this simplistic, black-and-white framing that deliberately ignores any nuance, the concept of contextual integrity or reasonable user expectations.

Of course, no one expects absolute secrecy in a public-facing Discord server. That's a straw man. The issue isn't about some naive belief that messages are invisible. It's about the scope, permanence, and method of access and archiving.

People participating in public Discord spaces have reasonable contextual expectations about how their words will be accessed and by whom. They expect their messages to be seen by current and maybe future server members - not extracted, permanently archived, and made globally searchable by entirely unrelated third parties.

This is similar to how conversations in a public park are technically "public," but most people would be rightfully disturbed if someone recorded everything, transcribed it, published it online with their names attached, and made it all searchable forever. Just because something isn't strictly private doesn't mean any and all forms of collection, republication, and indexing are ethically justified.

If you can't see the distinction between "not perfectly private within this specific semi-public space" and "archived indefinitely, and globally searchable forever by anyone, anywhere, for any reason," then you're either arguing in bad faith or your understanding of these issues is so superficial that further engagement is pointless.

deakam a month ago [ - ]

[flagged]

sReinwald a month ago [ - ]

It seems the core concept of contextual integrity is still not landing.

It's not a question of surprise that public data can be scraped - I'm well aware of how the internet functions, thank you. The point, which you seem determined to evade, is about the fundamental ethics of systematically doing so and the vast difference in impact and expectation between, say, a server's own moderation logs or incidental screenshots, and a third party, globally indexed, permanent archive. The former serves limited, often known functions within that specific community; the latter is a privacy-invasive data trawl weaponizing the 'public' label. Just because a thing is technically possible doesn't grant a free pass to ignore privacy implications or users' reasonable expectations of how their contributions will be used and disseminated.

Your attempt to dismantle the 'public park' analogy only underscores your misunderstanding of it. The scenario isn't about someone yelling (an exceptional event, often a public nuisance, that might indeed attract specific attention or recording). It's the equivalent of someone systematically planting listening devices by every park bench, transcribing every casual, low-expectation conversation - like my dinner plans with my girlfriend, or a vent about my boss - and then publishing it all online, forever, simply because the park itself is 'public' and it was a technically possible thing to do. The ethical chasm between observing a public spectacle and conducting mass, indiscriminate surveillance of every day, semi-private interactions within a public space shouldn't be this difficult to grasp. One involves a specific event; the other is a dragnet.

As for flagging, I didn't touch your comment. I have never flagged a single comment on this site. Perhaps others simply disagreed with the quality, relevance or the dismissive tone of your contribution.

deakam a month ago [ - ]

I won't continue a discussion with someone who relies on AI for writing, this response you posted presents the tells of someone using a language model to write a response paragraph.

a month ago [ - ]

[deleted]

deakam a month ago [ - ]

[flagged]