Pipeline vs Direct: Why We Abandoned Our 5-Model Architecture
We built an elegant 5-model pipeline for food nutrition estimation: VLM, SegFormer, GTE-small, R10b, and a database lookup. It achieved 125.3 MAE. Then a single fine-tuned VLM beat it at 87.7. Here's what we learned about error compounding and when simplicity wins.
The pipeline
The intuition behind the pipeline was sound: decompose a hard problem (photo to calories) into easier subproblems. Each model handles one step. Each step is testable and improvable independently. Five models, each doing what it does best.
Step 1: VLM (0.8B LoRA). Identifies food items in the image. Trained on 298K food images across Food2K (173K), Food-101 (99K), MM-Food (22K), and Indian Food (4K) plus 21K synthetic text samples. Misclassifies roughly 15% of foods, often outputting generic labels like "mixed vegetable" instead of "stir-fried bok choy with garlic".
Step 2: SegFormer. Segments the plate into individual food regions. Each region gets a bounding box and pixel mask for weight estimation.
Step 3: GTE-small. Matches each identified food label to the nearest entry in a 30,000-item USDA nutrition database via text embedding similarity. Of correct classifications, mismatches 40% to the wrong DB entries -- dried vs raw, fried vs grilled, concentrated vs fresh.
Step 4: R10b. Predicts total plate weight in grams from the image. A CurveNet + ResNet50 model trained on MFP3D and Nutrition5k data, adding roughly plus or minus 65g of weight error.
Step 5: DB Lookup. Combines the matched nutrition data with the predicted weight to compute total calories. No error of its own, but faithfully amplifies every upstream mistake.
The numbers
| Approach | Models | Cal MAE | Size |
|---|---|---|---|
| Full pipeline (Run E) | 5 | 125.3 | ~2GB total |
| Direct VLM (0.8B) | 1 | 112.3 | ~500MB |
| Direct VLM (2B) | 1 | 87.7 | 2B |
A single 2B model trained on 4,051 Nutrition5k samples outperformed a 5-model pipeline trained on 298,000 images. The pipeline used 75x more training data and produced worse results.
Error compounding
The fundamental problem with pipelines is that errors don't add -- they multiply. Each step in the pipeline introduces its own error, and downstream steps have no way to recover from upstream mistakes. A misclassified food gets matched to the wrong DB entry, which gets the wrong caloric density, which gets multiplied by an imperfect weight estimate.
VLM misclassification (15%)
The VLM, trained on classification data, learned to output generic labels. "Mixed vegetable" instead of "stir-fried bok choy with garlic". "Meat dish" instead of "grilled lamb chop". These generic labels are technically correct for classification but disastrous for nutrition lookup. It also consistently output exactly 2 items regardless of plate complexity.
GTE-small mismatch (40% of correct)
Even when the VLM classified correctly, GTE-small matched 40% of labels to the wrong database variant. "Egg whites" to "Egg, white, dried" (382 cal) instead of "Egg, white, raw" (52 cal). Isolated testing showed 153.1 MAE from this component alone.
R10b weight error (plus or minus 65g)
Weight estimation from a 2D image is fundamentally limited. R10b achieved 65.3 MAE with perfect DB matches -- respectable, but that error gets multiplied by the caloric density error from the steps above.
The compounding effect
A 10% error at each of 5 steps doesn't produce 50% total error. It compounds to 41%. But the real problem was worse: the errors were correlated. Misclassified foods tended to match to high-calorie DB variants, and compound labels treated as single items got full plate weight assigned to a high-calorie density food. The top 10 worst errors were 1,300 to 1,970 calories each.
The turning point
The direct estimation approach bypasses all five pipeline stages. A single VLM takes the photo and outputs structured JSON with per-item nutrition: food name, portion weight, calories, protein, carbs, and fat. No segmentation, no embedding search, no database lookup. The model learns the mapping from visual features to nutrition values end-to-end.
The 0.8B direct model hit 112.3 MAE on just 4,051 training samples. The merged sift1-0.8B got to 97.3. And the 2B full fine-tune achieved 87.7 MAE -- better than the pipeline in every dimension: accuracy, size, latency, and simplicity.
When pipelines win
This is not an argument that pipelines are always wrong. Pipelines have real advantages that direct models lack:
Interpretability. When the pipeline predicts 800 calories for a salad, you can inspect exactly where it went wrong. The VLM said "caesar salad", GTE-small matched to "Caesar salad, with chicken, fast food", R10b estimated 350g. You know the DB match was wrong. With a direct model, 800 calories for a salad is a black box. Was it the visual features? The portion estimate? The caloric density mapping? You cannot tell.
Debuggability. Our pipeline isolation test -- giving each component perfect inputs for everything except what it controls -- is what revealed that GTE-small was the bottleneck, not R10b. This diagnostic technique is impossible with an end-to-end model. The agent swarm fix that dropped DB lookup MAE from 153.1 to 58.0 only existed because the pipeline let us isolate the problem.
Independent improvement. Each pipeline component can be upgraded without retraining the whole system. Swap GTE-small for a better embedding model? Just re-run the matching step. Improve R10b with more training data? Everything downstream benefits automatically. Direct models require full retraining for any improvement.
But for shipping a product where MAE is the metric that matters, the direct model wins. The pipeline's debuggability was invaluable for discovering the solution, but the solution itself was a simpler architecture.
Lessons
- Error compounds multiplicatively through pipeline stages. Five 90%-accurate components don't give you 90% accuracy. They give you 59%. And the errors are often correlated -- misclassifications systematically match to high-calorie variants, creating catastrophic outliers.
- A single well-trained model often beats a pipeline of specialists. The 2B direct model achieved 87.7 MAE on 4K training samples. The 5-model pipeline achieved 125.3 on 298K. More components meant more failure modes, not more capability. Model capacity and end-to-end training beat decomposition.
- But pipelines are invaluable for debugging. Our pipeline isolation test is what revealed the GTE-small bottleneck. Without the pipeline, we would never have known that data quality -- not model architecture -- was the limiting factor. Build the pipeline to understand the problem. Then ship the simpler solution.
The pipeline was not wasted work. It taught us that GTE-small was the bottleneck, which led to the 10-agent swarm mapping fix. R10b proved weight estimation was solvable at 65.3 MAE. The merged sift1-0.8B showed that food knowledge transfer works. Each experiment narrowed the search space until the solution -- a bigger model, direct estimation, quantised to fit -- became obvious.