Might also be useful for dataset curation, or even just prompt engineering. For example when training a classification task and picking a diverse set of examples for training or evaluation.
Might also be useful for dataset curation, or even just prompt engineering. For example when training a classification task and picking a diverse set of examples for training or evaluation.
True, I think that's also a great usecase! Though these algorithms likely won't scale to very large datasets (e.g. millions of samples), but for smaller datasets, like fine-tuning sets, I think this would work very well. I've worked on something similar in the past that works for larger datasets (semantic deduplication: https://github.com/MinishLab/semhash).