have you considered a deterministic tier before the embedding pass? I feel that approach can be more efficient.

There are good mature tools for deterministic duplication detection and I intentionally focused on embedding-based to fill this gap (I didn't find other tools using this approach).

If by "more efficient" you mean to avoid embedding of the same code multiple times, this optimization is already implemented internally.

We did this by using the ASTs you can go quite far without embeddings and the result is easier to debug and follow what's going on.