The fix for machine unlearning in vector databases turns out to be conceptually simple, but it requires changing the semantics of retrieval.
Standard FAISS-style indices store vectors and compute:
argmax ⟨q, vᵢ⟩
If you insert -v, nothing happens. It’s just another point. The original vector is still maximally similar to itself and remains rank-1.
This isn’t a bug—it’s a consequence of selection-based retrieval.
If instead you store (vector, weight) pairs and evaluate: φ(q) = Σ wᵢ · K(q, vᵢ)
you get a different object entirely: a field, not a selection. Now inserting the same vector with w = −1 causes destructive interference. The contribution cancels. The attractor disappears.
Deletion becomes O(1) append-only (add the inverse), not a structural rebuild.
FAISS-style: Vec<Vec<f32>> → argmax (selection) Weighted form: Vec<(Vec<f32>, f32)> → Σ (field)
We validated this on 100k vectors: • FAISS: target stays rank-1 after “deletion” • Field-based model: exact cancellation (φ → 0), target unretrievable
The deeper point is that this isn’t a trick—it’s a semantic separation. • FAISS implements a selection operator over discrete points. • The weighted version implements a field operator where vectors act as kernels in a continuous potential. • Retrieval becomes gradient ascent to local maxima. • Deletion becomes destructive interference that removes attractors.
This shifts deletion from structural (modify index, rebuild, filter) to algebraic (append an inverse element). You get append-only logs, reversible unlearning, and auditable deletion records. The negative weight is the proof.
Implication: current vector DBs can’t guarantee GDPR/CCPA erasure without reconstruction. Field-based retrieval can—provably.
Paper with proofs: https://github.com/nikitph/bloomin/blob/master/negative-vect...
That makes sense, but how do you efficiently evaluate the Gaussian kernel based approach (“operator-based data structures (OBDS)”)? Presumably you want to do it in a way that keeps a dynamically updating data structure instead of computing a low rank approximation to the kernel etc? In my understanding the upside of the kNN based approaches are fast querying and ability to dynamically insert additional vectors..?
Thank you for the thoughtful comment. Your questions are valid given the title, which I used to make the post more accessible to a general HN audience. To clarify: the core distinction here is not kernelization vs kNN, but field evaluation vs point selection (or selection vs superposition as retrieval semantics). The kernel is just a concrete example.
FAISS implements selection (argmax ⟨q,v⟩), so vectors are discrete atoms and deletion must be structural. The weighted formulation represents a field: vectors act as sources whose influence superposes into a potential. Retrieval evaluates that field (or follows its gradient), not a point identity. In this regime, deletion is algebraic (append -v for cancellation), evaluation is sparse/local, and no index rebuild is required.
The paper goes into this in more detail.
This isn’t just X, it’s Y.
Hey man, nice slop