I understand the frustration with AI-written posts lately, but this was the opposite of that. It took months of hard work and many late nights. While the hardware manual (TRM) is public, it doesn't explain how to handle the strict 4KB memory bank limits. I had to figure out how to shard and tile the model because the hardware won't let you store data across those banks without crashing. It was a long battle with memory constraints to get that 15x speedup.