For sure.. for what it's worth though, I have run across several references to arm also implementing uop caches as a power optimization versus just running the decoders, so I'm inclined to say that whatever it's cost it pays for itself. I am not a chip designer though!
Apple never used a uop cache in their designs. ARM dropped uop caches when they removed 32-bit support. Qualcomm also skipped uop cache.
uop made sense with 32-bit support because the 32-bit ISA was so complex (though still simple compared to x86). Once they went to a simplified instruction design, the cost to decode every single time was lower than the cost of maintaining the uop cache.