How a 10-Agent Swarm Fixed Our Biggest Bottleneck
Our food nutrition model was stuck at 153.1 MAE on database lookups. The culprit: GTE-small, a general-purpose embedding model, was silently matching "eggs" to "eggs powder" and "chicken breast" to "chicken broth". A 10-agent swarm rebuilt 21,000 food classification labels overnight, dropping the error by 62%.
The problem
Our nutrition pipeline used GTE-small, a general-purpose text embedding model, to match VLM-identified food labels to entries in a 30,000-item USDA nutrition database. The approach was straightforward: embed the food label, find the nearest neighbour in the database, return its nutritional values.
It was silently catastrophic. GTE-small understood semantic similarity but not nutritional context. "Egg whites" matched to "Egg, white, dried" (382 cal/100g) instead of "Egg, white, raw" (52 cal/100g). "Chicken breast" matched to "Chicken breast, breaded, fried" instead of "Chicken breast, roasted". Every match was semantically reasonable but nutritionally wrong -- the dried, fried, processed, or concentrated variant was often closer in embedding space.
When we isolated this component with perfect weight estimation, GTE-small's database matching alone produced 153.1 calorie MAE. This was worse than the entire pipeline end-to-end, meaning the matching step was actively destroying information from other components.
The discovery
We'd been optimising the wrong component for a week. We assumed R10b (the weight estimation model) was the bottleneck because weight estimation is an inherently hard problem -- estimating grams from a 2D photo. But when we ran component isolation tests, R10b scored 65.3 MAE with perfect DB matches. GTE-small scored 153.1 with perfect weight. The math was unambiguous.
The isolation test was simple but revealing. Give each component perfect inputs for everything except what it controls, then measure how much error it alone introduces. R10b with perfect DB matches produced 65.3 MAE -- respectable for estimating weight from a photo. GTE-small with perfect weight produced 153.1 MAE -- worse than the entire pipeline combined. The embedding model, a component we'd assumed was "good enough," was the single biggest source of error.
The fix: 10-agent swarm
We couldn't manually review 21,000 food labels against a 30,000-entry database. So we built a swarm -- 10 Claude agents running in parallel, each with access to the USDA database, food knowledge, and a clear mandate: for each food label, find the exact correct database entry. Not the semantically closest. The nutritionally correct one.
Each agent processed roughly 2,100 food labels. The agents handled three types of labels:
1,260 multi-item splits. Compound labels like "pizza with vegetables" were split into individual items. Each item was mapped to its own database entry, enabling density-weighted calorie calculation later.
19,940 single items. Each label was reviewed and mapped to the exact correct USDA entry, accounting for preparation method (raw vs cooked vs dried), form (whole vs sliced vs powdered), and cooking method (grilled vs fried vs steamed).
The final mapping contained 52,426 entries: 21,000 originals plus lowercase and title case variants for case-insensitive lookup. Smart compound lookup at inference: always split on "with" and "and" first, try recombinations largest-first, mapping hit, DB direct, then GTE fallback with a strict 0.95 similarity threshold. This cascade achieved 91% mapping coverage.
The results
The before/after comparison was the most dramatic improvement in the entire project.
| Metric | Before (GTE-small) | After (Our mapping) | Improvement |
|---|---|---|---|
| DB Lookup MAE | 153.1 | 58.0 | -62% |
| Combined Pipeline MAE | ~200 | 92.9 | -54% |
| R10b weight (reference) | 65.3 | 65.3 | unchanged |
Smart lookup at inference
The mapping enabled a multi-step smart lookup at inference time. Instead of a single GTE-small embedding search, the system follows a cascade:
Step 1: Split compound labels. Always split on "with" and "and" first. "Chicken with rice and vegetables" becomes three separate lookups.
Step 2: Recombination search. Try recombinations largest-first. "Grilled chicken breast" tries "grilled chicken breast", then "chicken breast", then "chicken".
Step 3: Mapping lookup. Check the 52K-entry mapping table.
Step 4: Direct database match. If no mapping hit, try exact database match.
Step 5: GTE fallback. Only fall back to embedding search if all else fails, with a strict similarity threshold above 0.95 and remainder detection for partial matches.
The remaining 9% that fell through to GTE had good matches (0.90+ similarity), with only 7 labels identified as potentially risky.
The lesson
The takeaway for anyone building ML pipelines: isolate and benchmark each component independently before optimising. The bottleneck is almost never where you think it is. And when you find it, the fix might not be a better model -- it might be better data.