Prove it beats models of different architectures trained under identical limited resources?