Any thoughts on layering on-GPU work stealing or cudf on top?

For gfql (graph query language mapping down to cudf calls), we're trying to jettison the hot loop of python->cpu->gpu, so been loosely watching cuTile evolve!

[flagged]