Research On-Device AI Computer Vision

March 31, 2026 · 8 min read

Boba: On-Device Food Nutrition Estimation at Scale

We present Boba-2B, a 2-billion parameter vision-language model for on-device food nutrition estimation. At 87.7 MAE on Nutrition5k, it is the first on-device model under 90 MAE, achieving competitive results with cloud-based systems like GPT-4o while running entirely on-device. The 2B model performs inference without any cloud dependency.

87.7

Calorie MAE

Model Size

Parameters

On-device

Inference

438+

HF Downloads

Interactive food segmentation visualization. Individual food items are isolated for independent nutrition estimation by Boba-2B.

The Challenge

On-device nutrition estimation is hard for a fundamental reason: the models that are accurate enough to be useful are too large to run on a phone, and the models small enough to deploy on-device lack the capacity to generalize across cuisines, brands, and preparation methods.

Before this work, no published on-device model had Nutrition5k benchmarks. The state of the art required cloud APIs (GPT-4V, GPT-4o) or large models that demand server infrastructure. For a privacy-first nutrition app where photos never leave the device, none of these options work.

We needed a model under 2GB that could match cloud-based systems on accuracy while running entirely on an iPhone. Boba-2B is that model.

Our Journey

The path to 87.7 MAE was not linear. We explored four distinct approaches over three weeks, each teaching us something that led to the final architecture.

Model iteration progression: Calorie MAE over development approaches

0.8B LoRA baseline (112.3 MAE). We started with a Qwen3.5-0.8B LoRA fine-tune on 4,051 Nutrition5k samples. Promising for a first attempt, but the model struggled with compound meals and had no brand recognition.

5-model pipeline (125.3 MAE). We decomposed the problem: VLM for classification, SegFormer for segmentation, GTE-small for database matching, R10b for weight estimation, USDA lookup for calories. Trained on 298K food images. The pipeline produced worse results than direct estimation on 75x less data, due to error compounding across five steps.

Merged knowledge + direct estimation (97.3 MAE). We merged the pipeline model's food knowledge into the base weights to create sift1-0.8B, then fine-tuned for direct calorie estimation. This broke the sub-100 barrier for the first time on a sub-1B model. But in real-world testing, the merge had destroyed Qwen's base vision knowledge -- the model couldn't identify muffins, Red Bull, or ginger.

2B full fine-tune (87.7 MAE). The breakthrough: don't optimize a small model when a bigger model fits in the same deployment budget. We trained Qwen3.5-2B from base (no merge, no pipeline knowledge bake-in), full fine-tune on Nutrition5k. The 2B model achieved 87.7 MAE -- dramatically more capacity than the 0.8B, and the first on-device model under 90 MAE on Nutrition5k.

State-of-the-Art Comparison

Boba-2B achieves 87.7 calorie MAE on Nutrition5k -- the first on-device model under 90 MAE. Frontier model evaluations are pending.

Model	Cal MAE	On-Device?	Size
GPT-4o	—	No	Cloud
Gemini Pro	—	No	Cloud
Claude Opus	—	No	Cloud
GPT-4V	—	No	Cloud
Boba-2B	87.7	Yes	2B
Boba-0.8B	97.3	Yes	0.8B

Calorie MAE comparison across models (lower is better). Boba highlighted in purple.

Boba is the first on-device model under 90 MAE on Nutrition5k. At 87.7 calorie MAE, it achieves competitive results while running entirely on-device with zero cloud dependency. Frontier model evaluations (GPT-4o, Gemini Pro, Claude Opus, GPT-4V) are pending.

Architecture

Base Model

Qwen3.5-2B

Training

Full Fine-tune

Vision Encoder

331M (frozen)

LLM Params

1,882M (trained)

Base model: Qwen3.5-2B VLM. Full fine-tune (not LoRA) -- all 1,882M LLM parameters trained, 331M vision encoder parameters frozen. This preserves pre-trained visual features while teaching the language model to reason about nutrition.

Training data: 4,051 Nutrition5k samples with ground-truth per-ingredient calories. Each sample includes a photo and structured JSON with item-level nutrition: name, portion weight in grams, calories, protein, carbs, and fat.

Training config: Batch size 32 (2 per GPU x 8 GPUs x 2 gradient accumulation), learning rate 2e-5 with cosine schedule, 10 epochs, 1,260 total steps. Trained on 8x NVIDIA H100 80GB GPUs.

Deployment: The 2B model runs on iPhone via llama.cpp with no cloud dependency.

Key Insight

"The biggest breakthrough wasn't a better model -- it was discovering that our embedding lookup was silently corrupting 40% of our training signal. GTE-small matched 'egg whites' to 'egg, white, dried' (382 cal/100g) instead of 'egg, white, raw' (52 cal/100g). Once we isolated and fixed the lookup, everything else fell into place."

This discovery came from component isolation testing. When we gave R10b perfect database lookup, it achieved 65.3 calorie MAE. When we gave perfect weight estimation but used GTE-small for matching, the MAE was 153.1. The matching step was not just bad -- it was actively destroying information from other pipeline components. A 10-agent swarm rebuilt 21,000 food classification labels overnight, reducing matching MAE from 153.1 to 58.0.

Model Variants

We evaluated multiple model sizes. The 2B model achieves 87.7 calorie MAE -- the first on-device model under 90 MAE on Nutrition5k. The 0.8B variant trades accuracy for a smaller footprint.

Variant	MAE	Size	Runs on iPhone?
Boba-2B	87.7	2B	Yes
Boba-0.8B	97.3	0.8B	Yes

The 2B model achieves 87.7 MAE with stable, reliable on-device inference. For a production nutrition app, this represents the best trade-off between accuracy and deployability.

What's Next

Open-sourcing Boba-0.8B. We are releasing the 0.8B model weights publicly on HuggingFace. While the 2B model remains private for our production app, the 0.8B demonstrates that on-device food nutrition estimation is possible and reproducible.

Shipping in Sift. Boba-2B is the production model for Sift, our privacy-first nutrition tracking app. Users point their camera at food, and the model estimates per-item calories, protein, carbs, and fat in structured JSON -- entirely on-device, with zero cloud dependency.

Multi-cuisine expansion. Nutrition5k skews Western. We are curating additional training data from Indian, Middle Eastern, and East Asian cuisines to reduce systematic bias in under-represented food categories.

Parser improvements. 2% of dishes (10/506) fail to parse due to complex multi-item responses exceeding the token window. Constrained decoding and smarter stop criteria will close this gap.

HuggingFace

Doses AI Research · March 2026

Boba: On-Device Food Nutrition Estimation at Scale

The Challenge

Our Journey

State-of-the-Art Comparison

Architecture

Key Insight

Model Variants

What's Next

Related Research

Cite this paper