Research Weight Estimation Computer Vision

March 30, 2026 · 5 min read

R10b: Predicting Food Weight from a Single Image

R10b is a lightweight 200MB model that predicts food portion weight from a single image. Combined with accurate food classification and a nutrition database, it achieves 65.3 MAE -- proving that weight estimation, not classification, is the tractable problem in food nutrition AI.

65.3

Calorie MAE

~200MB

Model Size

3,125

Training Dishes

97%

Error Reduction

The Insight

"Everyone was trying to make VLMs predict calories directly. We asked: what if the model just predicts weight, and we look up the rest?"

Most food AI research focuses on food classification -- identifying what's on the plate. But in our pipeline experiments, we discovered something counterintuitive: classification was not the bottleneck. Weight estimation was the tractable problem.

When we isolated each component of our 5-model pipeline, R10b with perfect database lookup achieved 65.3 calorie MAE. Meanwhile, database lookup with perfect weight estimation (but using GTE-small for matching) produced 153.1 MAE. The gap was clear: weight estimation was already good enough. The database matching was the real problem.

This insight fundamentally changed our approach. Instead of trying to build a better classifier, we focused on fixing the lookup. A 10-agent swarm rebuilt 21,000 food labels, dropping matching MAE from 153.1 to 58.0. And it led us to ask: if we can predict weight accurately, and we can look up nutrition accurately, why not skip the pipeline entirely and train an end-to-end model? That question produced Boba-2B.

Pipeline Component Isolation

To understand where the pipeline was failing, we tested each component in isolation by replacing every other component with perfect (ground truth) data. This diagnostic technique -- holding all variables constant except one -- revealed the true accuracy of each piece.

Component	Cal MAE	What it proves
R10b weight only (perfect DB)	65.3	Weight estimation works
DB lookup only -- GTE-small (perfect weight)	153.1	GTE matching is terrible
DB lookup only -- our mapping (perfect weight)	58.0	Mapping fixes matching
R10b + our mapping combined	92.9	Pipeline ceiling

Pipeline component isolation: Calorie MAE with all other components held at ground truth

The most striking finding: GTE-small's database matching (153.1 MAE) was worse than the full end-to-end pipeline (125.3 MAE). The matching step was actively destroying information from other components. It matched "egg whites" to "egg, white, dried" (382 cal/100g) instead of "egg, white, raw" (52 cal/100g) -- semantically similar, nutritionally catastrophic.

Density-Weighted Splitting

R10b's most important contribution was enabling density-weighted calorie splitting for compound dishes. The problem: when a model identifies "pizza with vegetables" as a single item, the entire plate weight gets assigned to a single caloric density. For a 415-gram plate where pizza is only 55g (13% by weight), this produces catastrophic errors.

The solution: split compound labels into individual items and distribute weight inversely proportional to caloric density. High-calorie-density foods (pizza, cheese) get less weight; low-density foods (vegetables, salad) get more. This matches physical reality -- a plate of salad with a small slice of pizza has far more salad by weight.

Proof: dish_1568061272

A compound dish labelled "Vegetable Pizza" with ground truth of 380 calories. The results speak for themselves:

Method	Predicted Cal	Error
Compound "Vegetable Pizza"	1,980	1,600
Split + equal weight	632	252
Split + density-weighted	337	43
Ground truth	380	0

Error reduction through splitting methods. From 1,600 calories of error to 43 -- a 97% improvement.

Inverse caloric density correctly estimated pizza at 16% of plate weight (ground truth: 13%). The error dropped from 1,600 calories to 43 -- a 97% reduction from a single algorithmic change. No model retraining required.

Food Segmentation Pipeline

Food segmentation mask generated by our pipeline. Individual food items are isolated for independent weight estimation by R10b.

Architecture

Backbone 1

CurveNet

Backbone 2

ResNet50

Extra Input

bbox_ratio

Training Data

3,125 dishes

R10b combines two complementary feature extractors. CurveNet learns food shape and volume cues -- the 3D structure of a pile of rice, the curvature of a muffin, the flatness of a pizza slice. ResNet50 captures texture and food-type features -- the grain of bread, the gloss of a sauce, the color gradients of grilled meat.

A bounding box ratio input encodes the relative size of the food within the frame. Without this, the model has no scale reference -- a close-up of a grape and a wide shot of a watermelon could look identical in terms of shape and texture. The bbox_ratio breaks this ambiguity by encoding how much of the image the food occupies.

The model was trained on a combined dataset of 3,125 MFP3D and Nutrition5k dishes with ground-truth weights, evaluated on a held-out set of 760 dishes. It predicts total plate weight in grams. When combined with food classification and a nutrition database, this weight prediction enables per-item calorie estimation: identify the food, look up its caloric density (calories per 100g), multiply by the predicted weight.

The model is compact at roughly 200MB, fast enough for real-time inference, and runs comfortably on-device. It processes the fit-to-frame cropped image with masking to isolate the food region from background clutter.

What R10b Taught Us

R10b proved that weight estimation from a single image is a solvable problem. 65.3 gram MAE is precise enough to produce useful calorie estimates when combined with accurate food identification.

More importantly, R10b was the diagnostic tool that revealed where our pipeline was actually failing. Without component isolation testing, we would have blamed the VLM or the weight model for the pipeline's poor performance. Instead, we discovered that the database matching layer -- the simplest component, the one we assumed was "good enough" -- was the catastrophic bottleneck.

Ultimately, the Boba-2B model achieved better end-to-end results (87.7 calorie MAE) by learning to estimate calories directly, bypassing the pipeline entirely. But R10b was instrumental in proving which components worked and which were broken -- diagnostics that led directly to the agent swarm mapping fix and the final direct estimation approach.

"R10b didn't end up in the production pipeline. But without R10b, we never would have found the pipeline's real bottleneck, and Boba would not exist."

Doses AI Research · March 2026

R10b: Predicting Food Weight from a Single Image

The Insight

Pipeline Component Isolation

Density-Weighted Splitting

Proof: dish_1568061272

Food Segmentation Pipeline

Architecture

What R10b Taught Us

Related Research

Cite this paper