The framework used in the book, malt[0], is currently not GPU-accelerated, but it's being worked on.

Maybe interesting, I used it for a toy implementation of the GPT architecture[1] in about 500 lines.

(I studied with one of the authors, Dr. Daniel Friedman; wasn't super involved here but proofread a late draft and TA'd for a course based off the book.)

[0]: https://github.com/themetaschemer/malt

[1]: https://github.com/sporkl/malt-transformer