I struggle with this deviating from the vendor's "happy path" often. I mostly use the STM32 chips, and I don't particularly care for their HAL library. I find it over complicated and often has bugs in it that I have to track down and fix. But boy is it nice to use their STM32CubeMX program to generate all the low level code so I can just get to work. I tend to end up building my own low level libraries during my free time because I enjoy it and it gives me a better idea of how the hardware is actually working, but using the STM32 HAL library to write my actual client code at work.

Same experience here. What worked for me was using CubeMX purely for pin and clock config, then dropping down to the LL (low-layer) drivers or direct CMSIS register access for anything in a hot path. The HAL interrupt handlers in particular add a surprising amount of overhead — on a tight DMA transfer loop I measured ~40% cycle waste just from HAL callback dispatch.

The LL API is basically thin inline wrappers around register writes, so you still get the CubeMX-generated init code but without the HAL abstraction tax at runtime.