You can see their general approach to guardrail classifiers in these posts:
https://www.anthropic.com/research/constitutional-classifier... https://www.anthropic.com/research/next-generation-constitut...
It's not just keyword matching, but I'm sure they tuned the Fable classifiers pretty hard to avoid false negatives.