I think the do guess a priori what to query...
Later in the article, for a different (but similar) feature:
> To curate a representative set of synthetic emails, we start by creating a large set of synthetic messages on a variety of topics... We then derive a representation, called an embedding, of each synthetic message that captures some of the key dimensions of the message like language, topic, and length. These embeddings are then sent to a small number of user devices that have opted in to Device Analytics.
It's crazy to think Apple is constantly asking my iPhone if I ever write emails similar to emails about tennis lessons (their example). This feels like the least efficient way to understand users in this context. Especially considering they host an email server!
yeah, the linked paper [1] has more detail--basically they seem to start with a seed set of "class labels" and subcategories (e.g. "restaurant review" + "steak house"). They ask an LLM to generate lots of random texts incorporating those labels. They make a differentially private histogram of embedding similarities from those texts with the private data, then use that histogram to resample the texts, which become the seeds for the next iteration, sort of like a Particle Filter.
I'm still unclear on how you create that initial set of class labels used to generate the random seed texts, and how sensitive the method is to that initial corpus.
[1] https://arxiv.org/abs/2403.01749