Interesting. A few questions about the search layer: are you using dense retrieval, sparse, or hybrid? At 2M papers, how do you handle the drift between how engineers describe problems ("my RAG pipeline hallucinates on long docs") vs how papers describe solutions ("cross-document coherence in retrieval-augmented generation")? That query-document vocabulary gap is the hard part of academic search.

we've been working on search for over a year now - it's a complex hybrid system now. so it does use primitives like word-based search and embeddings etc. but it's power comes from a unqiue combination of all these and more techniques together.

yes, the gap between engineer descriptions and paper description is real - we had to work on that. we use a combinations of LLMs, vectors and a few more techniques to create a good mapping between the two. the vocab gap didn't harm us too much because we aren't only using word overlaps etc.