The initial motivation is to run benchmarks, though the foundation is flexible and can support many other use cases over time.
It's already proving useful. For example, I can run a benchmark, view the results in a dashboard, and even feed the report into Claude Code to answer questions like:
"How did changing X affect the results?" or "What could be improved in the next run?"
I'm developing a pipeline runner for matching decompilation: https://github.com/macabeus/mizuchi
The initial motivation is to run benchmarks, though the foundation is flexible and can support many other use cases over time.
It's already proving useful. For example, I can run a benchmark, view the results in a dashboard, and even feed the report into Claude Code to answer questions like: "How did changing X affect the results?" or "What could be improved in the next run?"
Curating a benchmark for reverse engineering functions doesn't seem a bad idea actually