Insight On-Device

March 2026 · 6 min read

Params Is All You Need: Why We Killed Our 0.8B and What a Red Bull Taught Us

We spent a week and six different approaches making a 0.8B parameter model marginally better. Then a quantized 2B made the entire effort irrelevant in 30 seconds. This is the story of how a Red Bull can broke our model, how quantization saved it, and why parameter count will always beat bit precision for on-device AI.

Approaches Tried

97.3

0.8B fp16 MAE

87.7

2B 4-bit MAE

2.5x

More Params

Same

Memory Budget

The Setup: Six Ways to Squeeze a Small Model

It started with a reasonable hypothesis: if our 0.8B parameter model could reach 112.3 MAE on Nutrition5k with basic LoRA fine-tuning, surely with enough engineering we could push it under 90. We had a constrained memory budget -- roughly 1.6GB of on-device RAM for the model -- and the 0.8B at fp16 precision fit perfectly within that envelope.

So we tried everything. Over the course of a week, we ran six distinct approaches to improve the 0.8B model, each more creative than the last.

#	Approach	Cal MAE	Notes
1	LoRA fine-tune (baseline)	112.3	Rank-16 LoRA on Qwen-0.8B, 4K samples
2	LoRA rank-64 + longer training	108.1	More adapter capacity, 2x epochs
3	Full fine-tune (0.8B)	104.6	All params trainable, same data
4	Data augmentation + curriculum	101.2	Synthetic samples, easy-to-hard schedule
5	Knowledge distillation from 7B	99.1	Teacher-student with soft labels
6	Merged pipeline knowledge	97.3	Best 0.8B result -- but fatally flawed

Six approaches. A week of compute. We went from 112.3 to 97.3 MAE -- a 13% improvement that felt hard-won. The merged knowledge approach from our 5-model pipeline gave us the best number, baking food classification knowledge directly into the model weights before fine-tuning for calorie estimation.

On paper, 97.3 MAE looked great. We were celebrating. Then we tested it on a Red Bull.

The Red Bull Test

"I pointed the model at a Red Bull can on my desk. It told me it was looking at a 'cylindrical yellow-green beverage container' with 340 calories, 45g protein, and 12g fat. A Red Bull has 110 calories, zero protein, and zero fat. The model didn't just get the numbers wrong -- it had no idea what it was looking at."

The merged pipeline approach had given us great benchmark numbers by baking food knowledge into the 0.8B's weights. But the merge had a catastrophic side effect: it destroyed Qwen's pre-trained vision knowledge. The model's ability to see -- to identify brands, read labels, recognize packaging -- was gone.

We ran more tests. The model couldn't identify muffins. It confused ginger with turmeric. It had no concept of portion sizes for packaged foods. Every benchmark improvement we'd fought for over the past week was a mirage -- the model performed well on Nutrition5k's curated test set but catastrophically failed on anything that required real-world visual understanding.

The 0.8B model didn't have enough capacity to hold both food knowledge and general vision understanding. We weren't dealing with a training problem. We were dealing with a capacity problem. And no amount of clever engineering could fix that.

The lesson was brutal: benchmark MAE and real-world performance are not the same thing. A model can score well on a curated test set while being completely useless in production. The Red Bull test became our internal quality gate -- if the model can't identify a branded product on a desk, it doesn't ship.

The Quantization Epiphany

The breakthrough came from frustration, not strategy. After the Red Bull debacle, someone on the team asked a simple question: what if we stopped trying to make the small model smarter and instead made a bigger model smaller?

Our memory budget was ~1.6GB. The 0.8B model at fp16 (16-bit floating point) uses about 1.6GB of RAM. But a 2B model at 4-bit quantization uses... also about 1.6GB of RAM. Same memory footprint. 2.5x more parameters.

0.8B at fp16

~1.6 GB

2B at 4-bit

~1.6 GB

Param Increase

2.5x

Extra Memory

Zero

We fine-tuned Qwen-2B on the same Nutrition5k data, quantized to 4-bit, and ran inference. The result arrived in 30 seconds.

87.7 MAE. Not only better than every single one of our six 0.8B approaches -- better than any on-device nutrition model ever published. And it passed the Red Bull test. It passed the muffin test. It identified ginger correctly. It could read brand labels. The model had enough capacity to hold everything.

"We spent a week making a 0.8B model marginally better. Then a quantized 2B made the entire effort irrelevant in 30 seconds. I stared at the terminal for a full minute before I could speak."

0.8B fp16 approaches vs. 2B 4-bit: Calorie MAE (lower is better)

Params > Precision: The Math

The intuition is straightforward once you see it. A neural network's capacity -- its ability to represent complex functions -- scales with parameter count. Precision affects the granularity of each individual weight, but the network's expressiveness is fundamentally determined by how many weights it has.

Consider the memory equation:

Memory = Parameters x Bits per Parameter / 8

8B at fp16:  0.8B x 16 / 8 = 1.6 GB
0B at  4-bit: 2.0B x  4 / 8 = 1.0 GB  (+ overhead ~1.6 GB)
0B at  2-bit: 3.0B x  2 / 8 = 0.75 GB (+ overhead ~1.6 GB)
0B at 1.58-bit: 8.0B x 1.58 / 8 = 1.58 GB

The math reveals an exponential opportunity. As you reduce bit precision, the number of parameters you can fit in the same memory budget grows dramatically. The question is: does reducing precision destroy model quality faster than adding parameters improves it?

The empirical answer, at least for our task, is a resounding no. The 2B 4-bit model outperformed the 0.8B fp16 model by nearly 10 MAE points. The loss from quantization was far smaller than the gain from additional parameters.

Configuration	Params	Precision	Memory	Cal MAE
0.8B fp16	0.8B	16-bit	~1.6 GB	97.3
2B 4-bit	2.0B	4-bit	~1.6 GB	87.7

This isn't just about our model. This is a general principle that the broader ML community has been converging on. Research from Microsoft (BitNet), Meta (LLaMA quantization studies), and others consistently show that parameter count is the dominant factor in model capability, and aggressive quantization is a surprisingly efficient way to trade precision for capacity.

Implications for On-Device AI

For anyone building models that need to run on phones, tablets, or edge devices, this insight changes the entire optimization strategy. The conventional approach -- start with a small model and engineer it to be better -- has a hard ceiling. You are fighting against the fundamental capacity of the architecture.

The better approach is to start with the largest model that could conceivably fit in your memory budget at low precision, then work backward to find the optimal quantization level.

The new playbook for on-device models:

1. Define your memory budget (e.g., 1.6 GB)
2. Find the largest base model that fits at aggressive quantization
3. Fine-tune at full precision, then quantize post-training
4. Test on real-world inputs, not just benchmarks
5. Only drop to a smaller model if quantization artifacts are unacceptable

This also reframes the value of techniques like LoRA, knowledge distillation, and data augmentation. These approaches are still valuable -- but they should be applied to the largest feasible model, not used to compensate for a model that is fundamentally too small.

We found that the 2B model responded far better to the same training techniques we'd applied to the 0.8B. The larger model had enough representational capacity to actually use the additional training signal, whereas the 0.8B was bottlenecked by its architecture regardless of how good the training data was.

What's Next: The Ternary Frontier

If 4-bit quantization let us fit 2.5x more parameters in the same budget, what happens at even lower bit widths?

This is the question that led us to 1.58-bit ternary quantization -- the BitNet approach, where every weight is one of three values: {-1, 0, +1}. At 1.58 bits per parameter, you can fit roughly 10x more parameters in the same memory as an fp16 model.

fp16 (baseline)

1x Params

4-bit Quantized

4x Params

2-bit Quantized

8x Params

1.58-bit (BitNet)

~10x Params

A 1.58-bit 8B model fits in the same memory as an fp16 0.8B. If the "params > precision" thesis holds -- and our 4-bit results strongly suggest it does -- then a ternary 8B model should dramatically outperform everything we've built so far, while still running entirely on-device.

We're actively exploring this direction. The challenges are real: ternary quantization requires training-aware approaches (you can't just post-training quantize to 1.58 bits), and the hardware support for ternary arithmetic is still maturing. But the potential is enormous.

"The Red Bull moment taught us to stop polishing small models and start thinking about how to pack more parameters into the same space. That insight is now the foundation of everything we're building."

The 0.8B model is dead. Long live the quantized billions.

HuggingFace

Doses AI Research · March 2026

Params Is All You Need: Why We Killed Our 0.8B and What a Red Bull Taught Us

The Setup: Six Ways to Squeeze a Small Model

The Red Bull Test

The Quantization Epiphany

Params > Precision: The Math

Implications for On-Device AI

What's Next: The Ternary Frontier

Related Research

Cite this article