The false positive rate you're describing matches what we see running similarity detection on generated text instead of code: cosine similarity alone flags a lot of same-topic pairs that aren't actually duplicates. What helped was combining the embedding score with a structural signal (AST edit distance for code, overlapping headings and citations for text) so no single metric makes the call. Also worth surfacing the raw similarity score in the CLI output instead of just a binary duplicate flag, since people will want to tune the threshold per codebase.