It uses a file tier system to prioritize what to analyze. Entry points, configs, and core source files get fetched fully. Tests and utilities get partial treatment. Generated code, lockfiles, and assets get skipped entirely. So even for large repos it focuses on the stuff that actually matters for understanding architecture.
For really massive repos (100K+ files) the analysis runs in a resumable pipeline - each of the 5 passes saves results to the database, so if the serverless function times out it picks up where it left off on the next connection. Embeddings for chat are also done incrementally in batches of 50 chunks.
That said, messy codebases are honestly where it's most useful. Clean well-documented repos don't need a tool like this. The ones with zero docs and 500 files with no clear structure are where it saves the most time.
It uses a file tier system to prioritize what to analyze. Entry points, configs, and core source files get fetched fully. Tests and utilities get partial treatment. Generated code, lockfiles, and assets get skipped entirely. So even for large repos it focuses on the stuff that actually matters for understanding architecture.
For really massive repos (100K+ files) the analysis runs in a resumable pipeline - each of the 5 passes saves results to the database, so if the serverless function times out it picks up where it left off on the next connection. Embeddings for chat are also done incrementally in batches of 50 chunks.
That said, messy codebases are honestly where it's most useful. Clean well-documented repos don't need a tool like this. The ones with zero docs and 500 files with no clear structure are where it saves the most time.