exactly i am actually working on complicated it can be if the model has to reason around the data, we use a combination of regex patterns, named entity recognition, and context aware detection to identify and tag PHI, PCI, and PII before tokenization. AWS doesn't do this for us that's actually the gap we built around.
Our approach is that tokens are format preserving and semantically typed. The model knows it's handling a name, a dollar amount, a diagnosis code it just never sees the real value.