I am working on https://www.accessmrf.com a catalog for all health care prices published by insurers under the Transparency in Coverage rule.

I recently wrote a blog post using Min Hashing to estimate that at least 90% of the 1.17 petabyte dataset are duplicates and I keep investigating new ways of making this dataset manageable. https://www.felixhaba.com/writing/simplifying-healthcare-pri...