This is really cool, and definitely needed.
Do you do any tracking of resource consumption over the runtime of a job? We have many jobs that use the requested memory only for a portion of the runtime, and are otherwise compute bound. It would be nice to be able to learn the profiles through time of jobs and layer them to get better resource utilization.
Yes :)
This is actually a really cool feature of the platform. We ingest DCGM, CUPTI, and cgroups to give users granular telemetry of what exactly is going on in the hardware they allocated when running jobs on it.
We also have profiler that has single digit overhead to correlate stack frames with hardware metrics. What this means is not only will you be able to see if you job was compute bound or memory bound at time x, but also you will be able to correlate this to areas in your code [currently only supported in python - other languages coming soon :) ]
Would love to show you a demo of this live. Feel free to email me at ismaeel@expanse.org.uk