When you are operating at scale you are likely to use a small model during the auto regressive phase to generate sequential tokens and only involve the large model once you've generated several tokens. Whenever the two predict the same output you effectively generate more than one token at a time. The idea is the models will agree often enough to significantly reduce output token costs. Does anyone know how effective that is in practice?