I don't see why the transformer architecture can't be designed and trained with separate inputs for control data and content data.

Give it a shot