> Is the wearable accurate enough to be sure that 3bpm is not a measurement fluke

If the statistical tests show significance (and are valid), the answer to this question is yes. If you have enough data you can make strong conclusions even witwith imperfect hardware.

Unless the effect they're measuring is that the wearable measures differently in sauna days.

Strong conclusion that the hardware is precisely imperfect?