Cross-Model Alignment Without Retraining: Matching Food Labels Across Independent Models
We present a method for aligning outputs from three independently trained food recognition models without any joint training or fine-tuning. Using text embeddings and the Hungarian algorithm, we achieve 85% exact-match alignment between a VLM's natural language food descriptions and a semantic segmentation model's fixed vocabulary -- enabling accurate per-item weight distribution in multi-food images.
The alignment problem
Estimating per-item calorie content from a photograph of a multi-item meal requires three capabilities: identifying each food item, estimating the weight of each item, and looking up nutritional density. Our system uses three independently trained models for these tasks -- and they speak completely different languages.
The Qwen3.5-0.8B VLM (fine-tuned, ~850MB) sees the full plate and outputs natural language: "grilled chicken breast with herbs." The SegFormer-b0 FoodSeg103 (3.7M parameters, ~14MB) produces per-pixel semantic segmentation with 104 fixed category labels -- it sees the same plate and outputs class ID 47 ("chicken duck"). The GTE-small embedding model (33M parameters, ~92MB total) bridges the gap.
These models were trained on different datasets, with different objectives, using different label vocabularies. The VLM has never seen a SegFormer label. The SegFormer has never seen natural language. Yet we must determine that "grilled chicken breast with herbs" and "chicken duck" refer to the same spatial region on the plate.
The Hungarian algorithm approach
The alignment method is a four-step process. All computation happens in text embedding space -- no image features are shared between models, no joint training is required, and the algorithm is model-agnostic.
Encode all VLM food names using GTE-small to produce matrix V of shape (N_vlm x 384).
Encode all SegFormer region labels using GTE-small to produce matrix S of shape (N_seg x 384).
Compute cosine similarity matrix: C = V . S^T of shape (N_vlm x N_seg). Each cell represents how semantically similar a VLM name is to a SegFormer label.
Apply the Hungarian algorithm to find the optimal 1-to-1 assignment minimizing negative similarity. With 3-8 items per plate, the assignment problem is trivially small -- microseconds to solve via scipy.optimize.linear_sum_assignment.
The key insight is that semantic similarity between food names is sufficient for alignment even when exact labels differ. "White rice" and "rice" produce a cosine similarity of 0.933. "Meat curry" and "sauce" produce 0.807 -- because the SegFormer labeled the curry region as "sauce," which is semantically close enough for the 1-to-1 constraint to resolve correctly.
Validation results
We tested alignment on multi-item food photographs from MM-Food-100K. The system correctly matched all primary food items (those occupying more than 5% of plate pixels) in every test case. Of 20 test queries, 17 achieved exact match -- an 85% rate. The 3 mismatches involved minor garnishes and shared regions, each contributing less than 3% of total plate area.
| VLM Name | SegFormer Label | Cosine Sim | Correct |
|---|---|---|---|
| white rice | rice | 0.933 | Yes |
| fried egg | egg | 0.914 | Yes |
| braised chicken | chicken duck | 0.837 | Yes |
| gravy | sauce | 0.823 | Yes |
| chili sauce | tomato | 0.817 | Yes |
| meat curry | sauce | 0.807 | Yes |
| fried tofu | pork | 0.804 | Yes |
The "fried tofu" to "pork" match illustrates the system's resilience. FoodSeg103 lacks a tofu category entirely, so the SegFormer labeled the tofu region as "pork" (visually similar fried protein). The 1-to-1 constraint from the Hungarian algorithm resolved this correctly: with no better match available for the tofu region, the assignment is forced and correct.
Cardinality mismatch
In practice, the VLM and SegFormer rarely agree on the number of items. The SegFormer tends to over-segment (finding 5-8 regions for a plate with 3 food items), while the VLM occasionally over-identifies (listing "spring onion" as a separate item when it is a garnish atop the rice).
Case A: SegFormer over-segments (most common)
When the SegFormer produces more regions than the VLM identifies food items, the Hungarian algorithm runs on the full N_vlm x N_seg cost matrix. Each VLM item gets its best-matching SegFormer region. Unmatched SegFormer regions are merged into the VLM item with highest embedding similarity, and their pixel counts are added to that item's proportion.
In our test with 3 VLM items and 5 SegFormer regions, the system correctly assigned rice (87.3%), curry (9.6%), and vegetables (3.1%), merging the extra "french fries" and "bread" regions (both SegFormer mislabels) into their closest VLM matches.
Case B: VLM over-identifies (less common)
When the VLM identifies more items than the SegFormer detected, the algorithm runs on the transposed matrix. Each SegFormer region gets its best-matching VLM item. Unmatched VLM items receive an equal share of their closest SegFormer region's pixels. This handles garnishes and small additions that the SegFormer grouped into a larger region.
GTE-small vs alternatives
We evaluated GTE-small against MiniLM-L6-v2, the most commonly recommended lightweight embedding model. GTE-small won on both accuracy and practical deployment characteristics despite being larger.
GTE-small achieved 17 out of 20 exact matches on test queries including difficult entries (pho, jollof rice, quarter pounder). It matched the accuracy of mpnet-base-v2 -- a model six times its size -- while maintaining a 58ms lookup time against a 29,977-food nutrition database. MiniLM-L6-v2 produced lower similarity scores on food-specific queries, particularly for non-Western foods where its training distribution is thinner.
The total GTE-small footprint is 92MB: a 70MB model plus a 22MB precomputed embedding matrix for the nutrition database. This matrix is computed once at build time and loaded at inference, eliminating the need to embed 29,977 food names at runtime.
On-device footprint
The complete model stack fits within approximately 1GB -- small enough for modern mobile devices to load all models simultaneously without memory pressure. Each component was selected for minimal size at acceptable accuracy.
| Component | Model | Parameters | Size | Role |
|---|---|---|---|---|
| Food identification | Qwen3.5-0.8B (LoRA, Q8) | 800M | ~850MB | Names food items from image |
| Segmentation | SegFormer-b0 FoodSeg103 | 3.7M | 14MB | Pixel-level food masks |
| Text matching | GTE-small | 33M | 92MB | Alignment + DB lookup |
| Weight estimation | R10b (CurveNet + ResNet50) | ~25M | 30MB | Total plate weight |
| Depth estimation | Depth Anything V2 Small | 24.8M | 25MB | Depth map for point cloud |
| Nutrition database | -- | 29,977 foods | 16MB | Caloric density per 100g |
| Total | ~1,027MB |
Why this works
The approach succeeds because of three properties. First, food names occupy a dense, well-structured region of embedding space. "Chicken breast" and "chicken" are close. "Rice" and "white rice" are close. The vocabulary of food is finite and semantically clustered, making cosine similarity a strong signal.
Second, the 1-to-1 constraint from the Hungarian algorithm is self-correcting. Even when individual similarity scores are ambiguous -- "fried tofu" is moderately similar to both "pork" and "potato" -- the global optimal assignment resolves conflicts. With only 3-8 items per plate, the combinatorial space is trivially small.
Third, SegFormer's errors are systematic and predictable. It consistently labels tofu as "potato," sauce as "tomato," and mixed proteins as "chicken duck." These systematic errors produce stable embedding distances that the Hungarian algorithm can learn to route around through the 1-to-1 constraint.
Limitations
- Visual area does not equal weight. A 100g steak and 100g of lettuce have very different visual areas. The system assumes proportional area-to-weight mapping, introducing systematic error for foods with extreme density differences. Density-weighted adjustment using per-food values from the nutrition database is a planned improvement.
- SegFormer vocabulary is limited to 103 classes. Unusual foods may be assigned to a visually similar but semantically different class. The embedding matching compensates, but edge cases exist where the wrong spatial region is selected.
- Mixed dishes cannot be spatially segmented. Curry, soup, and stew are correctly treated as single items by the VLM. The nutrition database provides per-dish caloric density for these cases.
- Small garnishes are filtered out. Items occupying less than 1% of food pixels fall below the SegFormer minimum threshold. Parsley, lemon wedges, and sesame seeds are excluded -- their caloric contribution is negligible.
Comparison with alternatives
| Approach | Accuracy | Training Data | Extra Size |
|---|---|---|---|
| Equal split (baseline) | Poor | None | 0MB |
| VLM estimates proportions | Unknown | 50K+ labeled images | 0MB |
| SAM auto-mask | Poor (segments non-food) | None | 91MB |
| GroundingDINO + SAM | Poor (can't detect food) | None | 300MB |
| SegFormer + GTE (ours) | Good | None | 106MB |
Our approach is the only method that requires zero proportion-specific training data, uses only lightweight models suitable for mobile deployment, correctly handles label mismatches between segmentation and identification, and gracefully resolves cardinality mismatches between over- and under-segmentation.