I want to look at this from a different perspective… a single-precision floating-point multiply is pretty simple, no? 24x24 bit multiply, which is about half as many gates as a 32x32 bit multiply.
Maybe I would prefer to rip out the integer multiplication unit first, before ripping out the FPU.