there are models between 2-grams and 600m param models that would be good options. i don't expect a 2-gram to do very well here. also i'm not sure why this model isn't a fine choice if it solves their problem

What would you suggest instead?

A non-autoregressive transformer trained with a classification objective.

These are absurdly effective for this kind of task. Training is fast and straight forward. Packaging for deployment as ONNX is pretty simple as well.

As a follow up to the original article, I added a new experiment using Logistic Regression and the results are very good. It actually improves on the accuracy by a few points.

More details here: https://www.teachmecoolstuff.com/viewarticle/using-logistic-...