Params Is All You Need: Why We Killed Our 0.8B and What a Red Bull Taught Us
We spent a week and six different approaches making a 0.8B parameter model marginally better. Then a quantized 2B made the entire effort irrelevant in 30 seconds. This is the story of how a Red Bull can broke our model, how quantization saved it, and why parameter count will always beat bit precision for on-device AI.
The Setup: Six Ways to Squeeze a Small Model
It started with a reasonable hypothesis: if our 0.8B parameter model could reach 112.3 MAE on Nutrition5k with basic LoRA fine-tuning, surely with enough engineering we could push it under 90. We had a constrained memory budget -- roughly 1.6GB of on-device RAM for the model -- and the 0.8B at fp16 precision fit perfectly within that envelope.
So we tried everything. Over the course of a week, we ran six distinct approaches to improve the 0.8B model, each more creative than the last.
| # | Approach | Cal MAE | Notes |
|---|---|---|---|
| 1 | LoRA fine-tune (baseline) | 112.3 | Rank-16 LoRA on Qwen-0.8B, 4K samples |
| 2 | LoRA rank-64 + longer training | 108.1 | More adapter capacity, 2x epochs |
| 3 | Full fine-tune (0.8B) | 104.6 | All params trainable, same data |
| 4 | Data augmentation + curriculum | 101.2 | Synthetic samples, easy-to-hard schedule |
| 5 | Knowledge distillation from 7B | 99.1 | Teacher-student with soft labels |
| 6 | Merged pipeline knowledge | 97.3 | Best 0.8B result -- but fatally flawed |
Six approaches. A week of compute. We went from 112.3 to 97.3 MAE -- a 13% improvement that felt hard-won. The merged knowledge approach from our 5-model pipeline gave us the best number, baking food classification knowledge directly into the model weights before fine-tuning for calorie estimation.
On paper, 97.3 MAE looked great. We were celebrating. Then we tested it on a Red Bull.
The Red Bull Test
The merged pipeline approach had given us great benchmark numbers by baking food knowledge into the 0.8B's weights. But the merge had a catastrophic side effect: it destroyed Qwen's pre-trained vision knowledge. The model's ability to see -- to identify brands, read labels, recognize packaging -- was gone.
We ran more tests. The model couldn't identify muffins. It confused ginger with turmeric. It had no concept of portion sizes for packaged foods. Every benchmark improvement we'd fought for over the past week was a mirage -- the model performed well on Nutrition5k's curated test set but catastrophically failed on anything that required real-world visual understanding.
The 0.8B model didn't have enough capacity to hold both food knowledge and general vision understanding. We weren't dealing with a training problem. We were dealing with a capacity problem. And no amount of clever engineering could fix that.
The Quantization Epiphany
The breakthrough came from frustration, not strategy. After the Red Bull debacle, someone on the team asked a simple question: what if we stopped trying to make the small model smarter and instead made a bigger model smaller?
Our memory budget was ~1.6GB. The 0.8B model at fp16 (16-bit floating point) uses about 1.6GB of RAM. But a 2B model at 4-bit quantization uses... also about 1.6GB of RAM. Same memory footprint. 2.5x more parameters.
We fine-tuned Qwen-2B on the same Nutrition5k data, quantized to 4-bit, and ran inference. The result arrived in 30 seconds.
87.7 MAE. Not only better than every single one of our six 0.8B approaches -- better than any on-device nutrition model ever published. And it passed the Red Bull test. It passed the muffin test. It identified ginger correctly. It could read brand labels. The model had enough capacity to hold everything.
Params > Precision: The Math
The intuition is straightforward once you see it. A neural network's capacity -- its ability to represent complex functions -- scales with parameter count. Precision affects the granularity of each individual weight, but the network's expressiveness is fundamentally determined by how many weights it has.
Consider the memory equation:
The math reveals an exponential opportunity. As you reduce bit precision, the number of parameters you can fit in the same memory budget grows dramatically. The question is: does reducing precision destroy model quality faster than adding parameters improves it?
The empirical answer, at least for our task, is a resounding no. The 2B 4-bit model outperformed the 0.8B fp16 model by nearly 10 MAE points. The loss from quantization was far smaller than the gain from additional parameters.
| Configuration | Params | Precision | Memory | Cal MAE |
|---|---|---|---|---|
| 0.8B fp16 | 0.8B | 16-bit | ~1.6 GB | 97.3 |
| 2B 4-bit | 2.0B | 4-bit | ~1.6 GB | 87.7 |
This isn't just about our model. This is a general principle that the broader ML community has been converging on. Research from Microsoft (BitNet), Meta (LLaMA quantization studies), and others consistently show that parameter count is the dominant factor in model capability, and aggressive quantization is a surprisingly efficient way to trade precision for capacity.
Implications for On-Device AI
For anyone building models that need to run on phones, tablets, or edge devices, this insight changes the entire optimization strategy. The conventional approach -- start with a small model and engineer it to be better -- has a hard ceiling. You are fighting against the fundamental capacity of the architecture.
The better approach is to start with the largest model that could conceivably fit in your memory budget at low precision, then work backward to find the optimal quantization level.
1. Define your memory budget (e.g., 1.6 GB)
2. Find the largest base model that fits at aggressive quantization
3. Fine-tune at full precision, then quantize post-training
4. Test on real-world inputs, not just benchmarks
5. Only drop to a smaller model if quantization artifacts are unacceptable
This also reframes the value of techniques like LoRA, knowledge distillation, and data augmentation. These approaches are still valuable -- but they should be applied to the largest feasible model, not used to compensate for a model that is fundamentally too small.
We found that the 2B model responded far better to the same training techniques we'd applied to the 0.8B. The larger model had enough representational capacity to actually use the additional training signal, whereas the 0.8B was bottlenecked by its architecture regardless of how good the training data was.
What's Next: The Ternary Frontier
If 4-bit quantization let us fit 2.5x more parameters in the same budget, what happens at even lower bit widths?
This is the question that led us to 1.58-bit ternary quantization -- the BitNet approach, where every weight is one of three values: {-1, 0, +1}. At 1.58 bits per parameter, you can fit roughly 10x more parameters in the same memory as an fp16 model.
A 1.58-bit 8B model fits in the same memory as an fp16 0.8B. If the "params > precision" thesis holds -- and our 4-bit results strongly suggest it does -- then a ternary 8B model should dramatically outperform everything we've built so far, while still running entirely on-device.
We're actively exploring this direction. The challenges are real: ternary quantization requires training-aware approaches (you can't just post-training quantize to 1.58 bits), and the hardware support for ternary arithmetic is still maturing. But the potential is enormous.
The 0.8B model is dead. Long live the quantized billions.