I'm developing a pipeline runner for matching decompilation: https://github.com/macabeus/mizuchi

The initial motivation is to run benchmarks, though the foundation is flexible and can support many other use cases over time.

It's already proving useful. For example, I can run a benchmark, view the results in a dashboard, and even feed the report into Claude Code to answer questions like: "How did changing X affect the results?" or "What could be improved in the next run?"