If you are going to go to the bother of fine tuning for trivial problems like subject classification then I think you'll find Scikit Learn with a SGDClassifier on 2-grams will do probably just as well and be under 1MB for the trained classifier.
You can train it in under a minute, and it will work perfectly well on embedded devices.
Small LLMs are good choices for text classification in two cases:
- If you next to provide in-context examples and classifier based on them.
- Your classification goes beyond simple subject-type classifiers. For example, multiple choice question answering is classification where small LLM will work but traditional ML methods won't/
I would also recommend the approach of using an llm to create the examples, and then train from there.
You can even get fancy and do things like active learning with the llm taking the role of the human annotator and sending in trial statements (and you can use a cheap one for larger gen and a more expensive one for the classification).
I’d be interested in seeing how well LLMs work with writing things like code for what snorkel AI used to have (there was open source code a while back that I assume is still around somewhere, you wrote code that was a low quality set of classifiers and it trained a model around those)
there are models between 2-grams and 600m param models that would be good options. i don't expect a 2-gram to do very well here. also i'm not sure why this model isn't a fine choice if it solves their problem
may I ask where did you get the list? I am looking for ways to get involved in going little more deeper on LLMs (I have very high level understanding, but my direct work doesn't involve them, hence I am not familiar with deeper details)
I'd been working with language models for several years before LLMs were a solution to this kind of problem. These are some ideas "off the top of my head" about how you can do classification in various ways. There's really a lot of ways to tackle it now, and a lot of trade-offs you can learn by experimenting with them.
There's even more options still, especially if you go further back toward more traditional methods. Static word vectors like GloVe or fasttext (optionally more modern equivalents like WordLlama or Model2Vec). Then there's sklearn-style stuff too. Those can be really small/fast but have more accuracy-level tradeoffs.
> The model invents new categories (e.g. apartments) and doesn’t stick to the provided list of allowed categories
Can this specific failure mode be solved by providing a grammar that the output must adhere to? (Not sure if Qwen has this feature, it's used for eg. to ensure the output is parseable json)
Yes, you can use constrained decoding like logit masking to force all invalid tokens in the vocabulary to -inf, and effectively be removed from selection. I believe llama.cpp exposes this by accepting a formatted grammar.
If you're gonna fine-tune for a closed set classification problem like this, you could just fine-tune BERT and get a faster model with better performance.
I have! I recently compared Gemma 1b to ModernBERT Large for a binary classification task and ModernBERT was the clear winner. It learned faster and performed the task better by a significant margin by the end of training. It seems the bidirectional encoder only architecture works really well for classification tasks, and I think it is related to being bidirectional whereas decoder only models like Gemma (or Qwen) can only “look backwards”. I used a mixture of FFT and LoRA as well as a mixture of CE Loss and SupCon Loss.
In a general sense, you see this crop up in SKILLS.md files and other places, as LLMs have to deal with broad contexts. People try and drill into some taxonomy in a naive fashion using plaintext as a directive, which is not particularly optimal.
I wonder if one could build a 'mixture of experts' at the model level that leveraged a variety of small models "within" a larger model...
existing embedding models like alibaba's modernbert tune or one of the jina v5s would probably map query to category automatically. (i.e. store embeddings of each category and calculate cosine sim for each incoming query vs. categories and pick the closest)
also, you could stick a classifier head on a BERT model as another option.
Anything below one billion parameters you can run on the CPU at acceptable speed
For larger sizes you still can, it just becomes slower and slower. For a simple classification task (small input, tiny output, and you can constrain output to a couple tokens) you could even run something like a 4B or 8B model on the CPU
I guess that technically depends on the software used to run the model, but in general it's always been possible to run on a CPU (and may even be possible to run on TPU or something else). It's just been slower. Likewise GPU RAM vs system RAM and the bandwidths involved can make hard bottlenecks.
GPU and VRAM (or fast unified RAM) is generally the option that is both available and performant, but especially really small models also run quite well on CPU and system RAM.
The looping may be due to quantization -- I've seen it on locally quantized Q6_K Qwen 3.5/3.6 models. I recall seeing somewhere (here or r/LocalLlama) that Qwen models are sensitive to quantization of the keys, though I haven't yet experimented with/looked into fixing this. (I've been building up my promptfoo tests/infrastructure to detect looping, etc. on Qwen and other models.)
A fun thing I do with Qwen 3.5 0.8b is to take a screenshot of the Hackernews homepage and ask it to give me a JSON representation of the data and it does surprisingly well. With a well structured prompt I think it could be made to be pretty reliable tool for that type of task out of the box.
Yes apologies, Hackernews was just an example, you can do this with any website - it’s just a simple benchmark I like to use for testing vision models.
I mean it's always nice to play around with sLLM finetuning, but for practical purposes I would always start with a lazy learner using embeddings (something like a small Stella model), pre-embed the topics/categories, embed the question, perform a kNN using cosine distance. You can use an LLM to "expand" the topics before embedding to make them more contextual. This is usually super fast and super simple and gives you a nice baseline. Then I would add a classification head after embedding layer (with maybe some dropout + 2-3 MLP layers) and train my own classifier, and compare that to lazy learner. Only after that would I start finetuning an LLM.
If you are going to go to the bother of fine tuning for trivial problems like subject classification then I think you'll find Scikit Learn with a SGDClassifier on 2-grams will do probably just as well and be under 1MB for the trained classifier.
You can train it in under a minute, and it will work perfectly well on embedded devices.
Small LLMs are good choices for text classification in two cases:
- If you next to provide in-context examples and classifier based on them.
- Your classification goes beyond simple subject-type classifiers. For example, multiple choice question answering is classification where small LLM will work but traditional ML methods won't/
Not with 800 examples. If you are going to consider an ngram model, I think you are better off getting a frontier llm to write you an absurd regex.
Hmm maybe. Turns out the author trained a logistic-regression classifier on the embeddings too, but didn't report the results:
https://github.com/thelgevold/fine-tuned-classifier/blob/mai...
Expanding on this experiment using logistic regression is an interesting continuation, detailed here: https://www.teachmecoolstuff.com/viewarticle/using-logistic-...
In summary: Using logistic regression actually improves accuracy, but also performance during both runtime and during training.
I would also recommend the approach of using an llm to create the examples, and then train from there.
You can even get fancy and do things like active learning with the llm taking the role of the human annotator and sending in trial statements (and you can use a cheap one for larger gen and a more expensive one for the classification).
I’d be interested in seeing how well LLMs work with writing things like code for what snorkel AI used to have (there was open source code a while back that I assume is still around somewhere, you wrote code that was a low quality set of classifiers and it trained a model around those)
A small transformer like BERT or variants is a better fit. It only takes a few examples, which can be generated synthetically using an LLM.
Trains quickly and classifies speedily on modern hardware.
Had a lot of fun doing stuff like this years ago, before LLMs were a thing.
there are models between 2-grams and 600m param models that would be good options. i don't expect a 2-gram to do very well here. also i'm not sure why this model isn't a fine choice if it solves their problem
What would you suggest instead?
A non-autoregressive transformer trained with a classification objective.
These are absurdly effective for this kind of task. Training is fast and straight forward. Packaging for deployment as ONNX is pretty simple as well.
If you want to go deeper on language models, try these project ideas:
- Zero-shot encoders like tasksource or GliNER
- Natural language inference: https://huggingface.co/blog/dleemiller/nli-xenc-ways-to-use
- GRPO training
- GEPA prompt tuning Qwen 0.6B (or GEPA, then GRPO)
- Use an embedding model and train a classifier (MLP, logistic, svm)
- Use a larger LLM to generate a synthetic dataset (beware of lack of diversity, mine "seed text" from real sources first)
- Synthetically generate "hard examples" where more than one category may be valid and DPO tune your preferred responses
may I ask where did you get the list? I am looking for ways to get involved in going little more deeper on LLMs (I have very high level understanding, but my direct work doesn't involve them, hence I am not familiar with deeper details)
I'd been working with language models for several years before LLMs were a solution to this kind of problem. These are some ideas "off the top of my head" about how you can do classification in various ways. There's really a lot of ways to tackle it now, and a lot of trade-offs you can learn by experimenting with them.
There's even more options still, especially if you go further back toward more traditional methods. Static word vectors like GloVe or fasttext (optionally more modern equivalents like WordLlama or Model2Vec). Then there's sklearn-style stuff too. Those can be really small/fast but have more accuracy-level tradeoffs.
If you are interested in small language model to fine tune, gemma3:270m is quite interesting for its size
> The model invents new categories (e.g. apartments) and doesn’t stick to the provided list of allowed categories
Can this specific failure mode be solved by providing a grammar that the output must adhere to? (Not sure if Qwen has this feature, it's used for eg. to ensure the output is parseable json)
It can.
It's something that is implemented by the thing that runs the model - eg Llama.cpp - rather than the model itself.
Note that it is hard to make work if you turn thinking on because the grammar gets complicated quickly (I don't recall if Qwen 0.6B can do thinking).
Thinking shouldn't be too hard to deal with---just let the model generate freely until it hits a </think> token, then do constrained decoding, right?
Sure, but does llama-cpp support that?
This was my thought as well. I'm surprised that it's not being used here (afaict)
Yes, you can use constrained decoding like logit masking to force all invalid tokens in the vocabulary to -inf, and effectively be removed from selection. I believe llama.cpp exposes this by accepting a formatted grammar.
But why using an encoder model instead of a BERT based model? For a pure classification that should be easier to train and work quite well
If you're gonna fine-tune for a closed set classification problem like this, you could just fine-tune BERT and get a faster model with better performance.
Has anyone compared recently doing something like ModernBERT plus classifier vs. full or lora FT of a small LM like qwen?
I have! I recently compared Gemma 1b to ModernBERT Large for a binary classification task and ModernBERT was the clear winner. It learned faster and performed the task better by a significant margin by the end of training. It seems the bidirectional encoder only architecture works really well for classification tasks, and I think it is related to being bidirectional whereas decoder only models like Gemma (or Qwen) can only “look backwards”. I used a mixture of FFT and LoRA as well as a mixture of CE Loss and SupCon Loss.
“As an example, the question “When did we replace our pool pump?” will be mapped to a category called “pool” before querying the Index database.”
Cool write up! Really appreciate it but incidentally how does this categorization help you get better retrieval results?
Categorization allows for retrieval strategy
In a general sense, you see this crop up in SKILLS.md files and other places, as LLMs have to deal with broad contexts. People try and drill into some taxonomy in a naive fashion using plaintext as a directive, which is not particularly optimal.
I wonder if one could build a 'mixture of experts' at the model level that leveraged a variety of small models "within" a larger model...
because you have a different vector store for each categorization?
What if the question crosses categories?
I think the Qwen 0.6B is so cool. It is super fast and as illustrated here it has a clear niche, esp. when fine-tuned.
I'm also interested in it as a student for distillation.
existing embedding models like alibaba's modernbert tune or one of the jina v5s would probably map query to category automatically. (i.e. store embeddings of each category and calculate cosine sim for each incoming query vs. categories and pick the closest)
also, you could stick a classifier head on a BERT model as another option.
Do small language models run on cpus or you still need a gpus to run them?
Anything below one billion parameters you can run on the CPU at acceptable speed
For larger sizes you still can, it just becomes slower and slower. For a simple classification task (small input, tiny output, and you can constrain output to a couple tokens) you could even run something like a 4B or 8B model on the CPU
I guess that technically depends on the software used to run the model, but in general it's always been possible to run on a CPU (and may even be possible to run on TPU or something else). It's just been slower. Likewise GPU RAM vs system RAM and the bandwidths involved can make hard bottlenecks.
GPU and VRAM (or fast unified RAM) is generally the option that is both available and performant, but especially really small models also run quite well on CPU and system RAM.
iGPUs are often slower or only as fast as CPUs when it comes to LLM text generation.
The advantage is mainly in memory bandwidth. External GPUs' internal memory is slightly faster than DDR attached to your CPU.
Other types of "AI" models do make use of the extra compute in GPUs but not LLMs.
Are 0.6b models useful without fine tuning?
Half of the times I ask qwen 0.6b "what is 1 + 2?" it ends up in a thinking loop of "but wait, the user is asking me to ..."
If you don't want the thinking, you can pass `enable_thinking: false` to the `chat_template_kwargs`. If using promptfoo, this can be done via:
The looping may be due to quantization -- I've seen it on locally quantized Q6_K Qwen 3.5/3.6 models. I recall seeing somewhere (here or r/LocalLlama) that Qwen models are sensitive to quantization of the keys, though I haven't yet experimented with/looked into fixing this. (I've been building up my promptfoo tests/infrastructure to detect looping, etc. on Qwen and other models.)A fun thing I do with Qwen 3.5 0.8b is to take a screenshot of the Hackernews homepage and ask it to give me a JSON representation of the data and it does surprisingly well. With a well structured prompt I think it could be made to be pretty reliable tool for that type of task out of the box.
While a fun poc, surely it would be better to just use the API (see the footer)? Or just `curl | x2j | jq` and map the HTML directly to JSON?
Yes apologies, Hackernews was just an example, you can do this with any website - it’s just a simple benchmark I like to use for testing vision models.
I mean it's always nice to play around with sLLM finetuning, but for practical purposes I would always start with a lazy learner using embeddings (something like a small Stella model), pre-embed the topics/categories, embed the question, perform a kNN using cosine distance. You can use an LLM to "expand" the topics before embedding to make them more contextual. This is usually super fast and super simple and gives you a nice baseline. Then I would add a classification head after embedding layer (with maybe some dropout + 2-3 MLP layers) and train my own classifier, and compare that to lazy learner. Only after that would I start finetuning an LLM.
Very cool write-up and GitHub repo!
Is it just me or half these comments read like AI
Tangentially related, but the UK Gov Incubator for AI has quite a nifty LLM driven classification pipeline for survey answers.
https://github.com/i-dot-ai/consult
[flagged]
[flagged]
[dead]
[dead]