Yes - that's a pain, especially because different papers often use different test sets, test settings or different metrics. What we do is find benchmarks across papers; aggregate them, highlight what is comparable and what is not
actually, we also just give a short set of bullet points to summarize all of it for you - so accepting the output is easy for you