Hacker News

NYCHMPAI 5 hours ago [ - ]

This is a great use case for embeddings. Code deduplication across distant modules is notoriously hard for traditional AST-based tools.

How do you handle chunking and parsing for different languages to make sure the embeddings capture semantic meaning effectively? For instance, do you chunk by functions/classes, or use a fixed token window? If a function is too long or too short, it can drastically skew the embedding similarity.

rkochanowski an hour ago [ - ]

Generally, I chunk by function/method (not by whole class), but different languages have specific concepts and features. Nested code units, anonymous functions, lambdas, closures are extracted as separate chunks.

The chunk size has allowed range and those outside are simply ignored.

- Upper limit is hardcoded with a body size of 10k chars

- Lower limit is configurable with a default of 10 AST nodes inside the body

The chunking strategy is something that can be improved in future versions.