If you are going to go to the bother of fine tuning for trivial problems like subject classification then I think you'll find Scikit Learn with a SGDClassifier on 2-grams will do probably just as well and be under 1MB for the trained classifier.
You can train it in under a minute, and it will work perfectly well on embedded devices.
Small LLMs are good choices for text classification in two cases:
- If you next to provide in-context examples and classifier based on them.
- Your classification goes beyond simple subject-type classifiers. For example, multiple choice question answering is classification where small LLM will work but traditional ML methods won't/
> The model invents new categories (e.g. apartments) and doesn’t stick to the provided list of allowed categories
Can this specific failure mode be solved by providing a grammar that the output must adhere to? (Not sure if Qwen has this feature, it's used for eg. to ensure the output is parseable json)
existing embedding models like alibaba's modernbert tune or one of the jina v5s would probably map query to category automatically. (i.e. store embeddings of each category and calculate cosine sim for each incoming query vs. categories and pick the closest)
also, you could stick a classifier head on a BERT model as another option.
If you're gonna fine-tune for a closed set classification problem like this, you could just fine-tune BERT and get a faster model with better performance.
I mean it's always nice to play around with sLLM finetuning, but for practical purposes I would always start with a lazy learner using embeddings (something like a small Stella model), pre-embed the topics/categories, embed the question, perform a kNN using cosine distance. You can use an LLM to "expand" the topics before embedding to make them more contextual. This is usually super fast and super simple and gives you a nice baseline. Then I would add a classification head after embedding layer (with maybe some dropout + 2-3 MLP layers) and train my own classifier, and compare that to lazy learner. Only after that would I start finetuning an LLM.
Good results fine tuning a local LLM like Qwen 3:0.6B to categorize questions
(teachmecoolstuff.com)210 points by dev-experiments 21 June 2026 | 49 comments
Comments
You can train it in under a minute, and it will work perfectly well on embedded devices.
Small LLMs are good choices for text classification in two cases:
- If you next to provide in-context examples and classifier based on them.
- Your classification goes beyond simple subject-type classifiers. For example, multiple choice question answering is classification where small LLM will work but traditional ML methods won't/
- Zero-shot encoders like tasksource or GliNER
- Natural language inference: https://huggingface.co/blog/dleemiller/nli-xenc-ways-to-use
- GRPO training
- GEPA prompt tuning Qwen 0.6B (or GEPA, then GRPO)
- Use an embedding model and train a classifier (MLP, logistic, svm)
- Use a larger LLM to generate a synthetic dataset (beware of lack of diversity, mine "seed text" from real sources first)
- Synthetically generate "hard examples" where more than one category may be valid and DPO tune your preferred responses
Can this specific failure mode be solved by providing a grammar that the output must adhere to? (Not sure if Qwen has this feature, it's used for eg. to ensure the output is parseable json)
The whole reason why embeddings work so well is because they encode the underlying meaning of the texts
Cool write up! Really appreciate it but incidentally how does this categorization help you get better retrieval results?
also, you could stick a classifier head on a BERT model as another option.
Half of the times I ask qwen 0.6b "what is 1 + 2?" it ends up in a thinking loop of "but wait, the user is asking me to ..."
I'm also interested in it as a student for distillation.
https://github.com/i-dot-ai/consult