Hacker News

I built an open-source RAG API in Rust to see how low I could push latency without a GPU.

RustyRAG v0.2 hits sub-200ms on localhost and sub-600ms from Azure North Central US to a browser in Brazil. 977 PDFs, 56K chunks in Milvus, 3 sources per response.

Key changes in v0.2: switched to Cerebras/Groq for LLM inference, replaced Cohere with Jina AI local embeddings (v5-text-nano-retrieval), and added optional contextual retrieval via LLM-generated chunk prefixes.