It is annoying how often our offline metrics look perfect while the actual A/B test shows zero lift. The article explains that gap really well by framing recommendations as an interventional problem rather than just an observational one. I guess we really need to start looking at counterfactual evaluation if we want our offline tests to actually mean something.