R10b: Predicting Food Weight from a Single Image
R10b is a lightweight 200MB model that predicts food portion weight from a single image. Combined with accurate food classification and a nutrition database, it achieves 65.3 MAE -- proving that weight estimation, not classification, is the tractable problem in food nutrition AI.
The Insight
Most food AI research focuses on food classification -- identifying what's on the plate. But in our pipeline experiments, we discovered something counterintuitive: classification was not the bottleneck. Weight estimation was the tractable problem.
When we isolated each component of our 5-model pipeline, R10b with perfect database lookup achieved 65.3 calorie MAE. Meanwhile, database lookup with perfect weight estimation (but using GTE-small for matching) produced 153.1 MAE. The gap was clear: weight estimation was already good enough. The database matching was the real problem.
This insight fundamentally changed our approach. Instead of trying to build a better classifier, we focused on fixing the lookup. A 10-agent swarm rebuilt 21,000 food labels, dropping matching MAE from 153.1 to 58.0. And it led us to ask: if we can predict weight accurately, and we can look up nutrition accurately, why not skip the pipeline entirely and train an end-to-end model? That question produced Boba-2B.
Pipeline Component Isolation
To understand where the pipeline was failing, we tested each component in isolation by replacing every other component with perfect (ground truth) data. This diagnostic technique -- holding all variables constant except one -- revealed the true accuracy of each piece.
| Component | Cal MAE | What it proves |
|---|---|---|
| R10b weight only (perfect DB) | 65.3 | Weight estimation works |
| DB lookup only -- GTE-small (perfect weight) | 153.1 | GTE matching is terrible |
| DB lookup only -- our mapping (perfect weight) | 58.0 | Mapping fixes matching |
| R10b + our mapping combined | 92.9 | Pipeline ceiling |
Density-Weighted Splitting
R10b's most important contribution was enabling density-weighted calorie splitting for compound dishes. The problem: when a model identifies "pizza with vegetables" as a single item, the entire plate weight gets assigned to a single caloric density. For a 415-gram plate where pizza is only 55g (13% by weight), this produces catastrophic errors.
The solution: split compound labels into individual items and distribute weight inversely proportional to caloric density. High-calorie-density foods (pizza, cheese) get less weight; low-density foods (vegetables, salad) get more. This matches physical reality -- a plate of salad with a small slice of pizza has far more salad by weight.
Proof: dish_1568061272
A compound dish labelled "Vegetable Pizza" with ground truth of 380 calories. The results speak for themselves:
| Method | Predicted Cal | Error |
|---|---|---|
| Compound "Vegetable Pizza" | 1,980 | 1,600 |
| Split + equal weight | 632 | 252 |
| Split + density-weighted | 337 | 43 |
| Ground truth | 380 | 0 |
Food Segmentation Pipeline
Food segmentation mask generated by our pipeline. Individual food items are isolated for independent weight estimation by R10b.
Architecture
R10b combines two complementary feature extractors. CurveNet learns food shape and volume cues -- the 3D structure of a pile of rice, the curvature of a muffin, the flatness of a pizza slice. ResNet50 captures texture and food-type features -- the grain of bread, the gloss of a sauce, the color gradients of grilled meat.
A bounding box ratio input encodes the relative size of the food within the frame. Without this, the model has no scale reference -- a close-up of a grape and a wide shot of a watermelon could look identical in terms of shape and texture. The bbox_ratio breaks this ambiguity by encoding how much of the image the food occupies.
The model was trained on a combined dataset of 3,125 MFP3D and Nutrition5k dishes with ground-truth weights, evaluated on a held-out set of 760 dishes. It predicts total plate weight in grams. When combined with food classification and a nutrition database, this weight prediction enables per-item calorie estimation: identify the food, look up its caloric density (calories per 100g), multiply by the predicted weight.
The model is compact at roughly 200MB, fast enough for real-time inference, and runs comfortably on-device. It processes the fit-to-frame cropped image with masking to isolate the food region from background clutter.
What R10b Taught Us
R10b proved that weight estimation from a single image is a solvable problem. 65.3 gram MAE is precise enough to produce useful calorie estimates when combined with accurate food identification.
More importantly, R10b was the diagnostic tool that revealed where our pipeline was actually failing. Without component isolation testing, we would have blamed the VLM or the weight model for the pipeline's poor performance. Instead, we discovered that the database matching layer -- the simplest component, the one we assumed was "good enough" -- was the catastrophic bottleneck.
Ultimately, the Boba-2B model achieved better end-to-end results (87.7 calorie MAE) by learning to estimate calories directly, bypassing the pipeline entirely. But R10b was instrumental in proving which components worked and which were broken -- diagnostics that led directly to the agent swarm mapping fix and the final direct estimation approach.