Boba: On-Device Food Nutrition Estimation at Scale
We present Boba-2B, a 2-billion parameter vision-language model for on-device food nutrition estimation. At 87.7 MAE on Nutrition5k, it is the first on-device model under 90 MAE, achieving competitive results with cloud-based systems like GPT-4o while running entirely on-device. The 2B model performs inference without any cloud dependency.
Interactive food segmentation visualization. Individual food items are isolated for independent nutrition estimation by Boba-2B.
The Challenge
On-device nutrition estimation is hard for a fundamental reason: the models that are accurate enough to be useful are too large to run on a phone, and the models small enough to deploy on-device lack the capacity to generalize across cuisines, brands, and preparation methods.
Before this work, no published on-device model had Nutrition5k benchmarks. The state of the art required cloud APIs (GPT-4V, GPT-4o) or large models that demand server infrastructure. For a privacy-first nutrition app where photos never leave the device, none of these options work.
We needed a model under 2GB that could match cloud-based systems on accuracy while running entirely on an iPhone. Boba-2B is that model.
Our Journey
The path to 87.7 MAE was not linear. We explored four distinct approaches over three weeks, each teaching us something that led to the final architecture.
0.8B LoRA baseline (112.3 MAE). We started with a Qwen3.5-0.8B LoRA fine-tune on 4,051 Nutrition5k samples. Promising for a first attempt, but the model struggled with compound meals and had no brand recognition.
5-model pipeline (125.3 MAE). We decomposed the problem: VLM for classification, SegFormer for segmentation, GTE-small for database matching, R10b for weight estimation, USDA lookup for calories. Trained on 298K food images. The pipeline produced worse results than direct estimation on 75x less data, due to error compounding across five steps.
Merged knowledge + direct estimation (97.3 MAE). We merged the pipeline model's food knowledge into the base weights to create sift1-0.8B, then fine-tuned for direct calorie estimation. This broke the sub-100 barrier for the first time on a sub-1B model. But in real-world testing, the merge had destroyed Qwen's base vision knowledge -- the model couldn't identify muffins, Red Bull, or ginger.
2B full fine-tune (87.7 MAE). The breakthrough: don't optimize a small model when a bigger model fits in the same deployment budget. We trained Qwen3.5-2B from base (no merge, no pipeline knowledge bake-in), full fine-tune on Nutrition5k. The 2B model achieved 87.7 MAE -- dramatically more capacity than the 0.8B, and the first on-device model under 90 MAE on Nutrition5k.
State-of-the-Art Comparison
Boba-2B achieves 87.7 calorie MAE on Nutrition5k -- the first on-device model under 90 MAE. Frontier model evaluations are pending.
| Model | Cal MAE | On-Device? | Size |
|---|---|---|---|
| GPT-4o | — | No | Cloud |
| Gemini Pro | — | No | Cloud |
| Claude Opus | — | No | Cloud |
| GPT-4V | — | No | Cloud |
| Boba-2B | 87.7 | Yes | 2B |
| Boba-0.8B | 97.3 | Yes | 0.8B |
Architecture
Base model: Qwen3.5-2B VLM. Full fine-tune (not LoRA) -- all 1,882M LLM parameters trained, 331M vision encoder parameters frozen. This preserves pre-trained visual features while teaching the language model to reason about nutrition.
Training data: 4,051 Nutrition5k samples with ground-truth per-ingredient calories. Each sample includes a photo and structured JSON with item-level nutrition: name, portion weight in grams, calories, protein, carbs, and fat.
Training config: Batch size 32 (2 per GPU x 8 GPUs x 2 gradient accumulation), learning rate 2e-5 with cosine schedule, 10 epochs, 1,260 total steps. Trained on 8x NVIDIA H100 80GB GPUs.
Deployment: The 2B model runs on iPhone via llama.cpp with no cloud dependency.
Key Insight
This discovery came from component isolation testing. When we gave R10b perfect database lookup, it achieved 65.3 calorie MAE. When we gave perfect weight estimation but used GTE-small for matching, the MAE was 153.1. The matching step was not just bad -- it was actively destroying information from other pipeline components. A 10-agent swarm rebuilt 21,000 food classification labels overnight, reducing matching MAE from 153.1 to 58.0.
Model Variants
We evaluated multiple model sizes. The 2B model achieves 87.7 calorie MAE -- the first on-device model under 90 MAE on Nutrition5k. The 0.8B variant trades accuracy for a smaller footprint.
| Variant | MAE | Size | Runs on iPhone? |
|---|---|---|---|
| Boba-2B | 87.7 | 2B | Yes |
| Boba-0.8B | 97.3 | 0.8B | Yes |
The 2B model achieves 87.7 MAE with stable, reliable on-device inference. For a production nutrition app, this represents the best trade-off between accuracy and deployability.
What's Next
Open-sourcing Boba-0.8B. We are releasing the 0.8B model weights publicly on HuggingFace. While the 2B model remains private for our production app, the 0.8B demonstrates that on-device food nutrition estimation is possible and reproducible.
Shipping in Sift. Boba-2B is the production model for Sift, our privacy-first nutrition tracking app. Users point their camera at food, and the model estimates per-item calories, protein, carbs, and fat in structured JSON -- entirely on-device, with zero cloud dependency.
Multi-cuisine expansion. Nutrition5k skews Western. We are curating additional training data from Indian, Middle Eastern, and East Asian cuisines to reduce systematic bias in under-represented food categories.
Parser improvements. 2% of dishes (10/506) fail to parse due to complex multi-item responses exceeding the token window. Constrained decoding and smarter stop criteria will close this gap.