If they game the pelican benchmark, it’d be pretty obvious.
Just try other random, non-realistic things like “a giraffe walking a tightrope”, “a car sitting at a cafe eating a pizza”, etc.
If the results are dramatically different, then they gamed it. If they are similar in quality, then they probably didn’t.
[deleted]