Hacker News

Yes :)

This is actually a really cool feature of the platform. We ingest DCGM, CUPTI, and cgroups to give users granular telemetry of what exactly is going on in the hardware they allocated when running jobs on it.

We also have profiler that has single digit overhead to correlate stack frames with hardware metrics. What this means is not only will you be able to see if you job was compute bound or memory bound at time x, but also you will be able to correlate this to areas in your code [currently only supported in python - other languages coming soon :) ]

Would love to show you a demo of this live. Feel free to email me at ismaeel@expanse.org.uk