Engineering Agent Systems Data Quality March 30, 2026  ·  6 min read

How a 10-Agent Swarm Fixed Our Biggest Bottleneck

Our food nutrition model was stuck at 153.1 MAE on database lookups. The culprit: GTE-small, a general-purpose embedding model, was silently matching "eggs" to "eggs powder" and "chicken breast" to "chicken broth". A 10-agent swarm rebuilt 21,000 food classification labels overnight, dropping the error by 62%.

153.1
MAE before (GTE-small)
58.0
MAE after (our mapping)
-62%
Error reduction

The problem

Our nutrition pipeline used GTE-small, a general-purpose text embedding model, to match VLM-identified food labels to entries in a 30,000-item USDA nutrition database. The approach was straightforward: embed the food label, find the nearest neighbour in the database, return its nutritional values.

It was silently catastrophic. GTE-small understood semantic similarity but not nutritional context. "Egg whites" matched to "Egg, white, dried" (382 cal/100g) instead of "Egg, white, raw" (52 cal/100g). "Chicken breast" matched to "Chicken breast, breaded, fried" instead of "Chicken breast, roasted". Every match was semantically reasonable but nutritionally wrong -- the dried, fried, processed, or concentrated variant was often closer in embedding space.

The silent failure: GTE-small matched semantically similar but nutritionally wrong database entries.

When we isolated this component with perfect weight estimation, GTE-small's database matching alone produced 153.1 calorie MAE. This was worse than the entire pipeline end-to-end, meaning the matching step was actively destroying information from other components.

The discovery

We'd been optimising the wrong component for a week. We assumed R10b (the weight estimation model) was the bottleneck because weight estimation is an inherently hard problem -- estimating grams from a 2D photo. But when we ran component isolation tests, R10b scored 65.3 MAE with perfect DB matches. GTE-small scored 153.1 with perfect weight. The math was unambiguous.

"GTE-small was the bottleneck, NOT R10b. It picked wrong food variants -- dried vs raw, meatless vs real. We'd been optimising the wrong component for a week."

The isolation test was simple but revealing. Give each component perfect inputs for everything except what it controls, then measure how much error it alone introduces. R10b with perfect DB matches produced 65.3 MAE -- respectable for estimating weight from a photo. GTE-small with perfect weight produced 153.1 MAE -- worse than the entire pipeline combined. The embedding model, a component we'd assumed was "good enough," was the single biggest source of error.

The fix: 10-agent swarm

We couldn't manually review 21,000 food labels against a 30,000-entry database. So we built a swarm -- 10 Claude agents running in parallel, each with access to the USDA database, food knowledge, and a clear mandate: for each food label, find the exact correct database entry. Not the semantically closest. The nutritionally correct one.

Agents
10 parallel
Labels processed
21,000
Final mapping
52,426 entries
Runtime
~8 hours

Each agent processed roughly 2,100 food labels. The agents handled three types of labels:

1,260 multi-item splits. Compound labels like "pizza with vegetables" were split into individual items. Each item was mapped to its own database entry, enabling density-weighted calorie calculation later.

19,940 single items. Each label was reviewed and mapped to the exact correct USDA entry, accounting for preparation method (raw vs cooked vs dried), form (whole vs sliced vs powdered), and cooking method (grilled vs fried vs steamed).

The final mapping contained 52,426 entries: 21,000 originals plus lowercase and title case variants for case-insensitive lookup. Smart compound lookup at inference: always split on "with" and "and" first, try recombinations largest-first, mapping hit, DB direct, then GTE fallback with a strict 0.95 similarity threshold. This cascade achieved 91% mapping coverage.

The results

The before/after comparison was the most dramatic improvement in the entire project.

Database lookup MAE: from 153.1 (GTE-small) to 58.0 (agent mapping). Combined pipeline from ~200 to 92.9.
Metric Before (GTE-small) After (Our mapping) Improvement
DB Lookup MAE 153.1 58.0 -62%
Combined Pipeline MAE ~200 92.9 -54%
R10b weight (reference) 65.3 65.3 unchanged

Smart lookup at inference

The mapping enabled a multi-step smart lookup at inference time. Instead of a single GTE-small embedding search, the system follows a cascade:

Step 1: Split compound labels. Always split on "with" and "and" first. "Chicken with rice and vegetables" becomes three separate lookups.

Step 2: Recombination search. Try recombinations largest-first. "Grilled chicken breast" tries "grilled chicken breast", then "chicken breast", then "chicken".

Step 3: Mapping lookup. Check the 52K-entry mapping table.

Step 4: Direct database match. If no mapping hit, try exact database match.

Step 5: GTE fallback. Only fall back to embedding search if all else fails, with a strict similarity threshold above 0.95 and remainder detection for partial matches.

The remaining 9% that fell through to GTE had good matches (0.90+ similarity), with only 7 labels identified as potentially risky.

The lesson

Data quality > model quality. We spent weeks fine-tuning models, experimenting with architectures, and scaling training data from 4K to 298K samples. The single biggest accuracy improvement came from fixing the data layer -- building correct food-to-database mappings. No model retraining. No architecture changes. Just better data. The mapping alone reduced DB lookup MAE by 62%, from 153.1 to 58.0. The GTE-small problem was invisible for weeks because the pipeline produced plausible outputs. "Reasonable" was hiding systematic errors: every egg dish was off by 300+ calories, every chicken dish was biased toward fried variants. In a multi-step pipeline, the weakest component defines the ceiling. And the weakest component is often the one you assumed was working.

The takeaway for anyone building ML pipelines: isolate and benchmark each component independently before optimising. The bottleneck is almost never where you think it is. And when you find it, the fix might not be a better model -- it might be better data.

Doses AI Engineering · March 2026