We use mocking to replace actual LLM calls when testing for the correctness of the ThalamusDB code. In terms of performance benchmarking, we ran quite a few experiments measuring time, costs (fees for LLM calls), and result accuracy. The latter one is the hardest to evaluate since we need to compare the ThalamusDB results to the ground truth. Often, we used data sets from Kaggle that come with manual labels (e.g., camera trap pictures labeled with the animal species, then we can get ground truth for test queries that count the number of pictures showing specific animals).

When someone claims that a system can search “approximately” or “semantically” that means there some sort of statistical behavior. There will be error. That error can be systematically characterized with enough data. But if it can’t or isn’t, then it’s a toy.

A problem I have with LLMs and the way they are marketed is that are being treated as and offered as if they were toys.

You’ve given a few tantalizing details, but what I would really admire is a link to full details about exactly what you did to collect sufficient evidence that this system can be trusted and in what ways it can be trusted.

The approximation in ThalamusDB is relative to the best accuracy that can be achieved using the associated language models (LLMs). E.g., if ThalamusDB processes a subset of rows using LLMs, it can reason about possible results when applying LLMs to the remaining rows (taking into account all possible outcomes).

In general, when using LLMs, there are no formal guarantees on output quality anymore (but the same applies when using, e.g., human crowd workers for comparable tasks like image classification etc.).

Having said that, we did some experiments evaluating output accuracy for a prior version of ThalamusDB and the results are here: https://dl.acm.org/doi/pdf/10.1145/3654989 We will actually publish more results with the new version within the next few months as well. But, again, no formal guarantees.

With humans we don’t need guarantees, because we have something called accountability and reputation. We also understand a lot about how and why humans make errors, and so human errors make sense to us.

But LLMs routinely make errors that if made by a human would cause us to believe that human is utterly incompetent, acting in bad faith, or dangerously delusional. So we should never just shrug and say nobody’s perfect. I have to be responsible for what my product does.

Thanks for the link!