Research Multi-Model Systems Embeddings March 29, 2026  ·  6 min read

Cross-Model Alignment Without Retraining: Matching Food Labels Across Independent Models

We present a method for aligning outputs from three independently trained food recognition models without any joint training or fine-tuning. Using text embeddings and the Hungarian algorithm, we achieve 85% exact-match alignment between a VLM's natural language food descriptions and a semantic segmentation model's fixed vocabulary -- enabling accurate per-item weight distribution in multi-food images.

3
Independent models
85%
Exact match rate
58ms
Lookup time
1,027
Total MB on-device
0
Retraining required

The alignment problem

Estimating per-item calorie content from a photograph of a multi-item meal requires three capabilities: identifying each food item, estimating the weight of each item, and looking up nutritional density. Our system uses three independently trained models for these tasks -- and they speak completely different languages.

The Qwen3.5-0.8B VLM (fine-tuned, ~850MB) sees the full plate and outputs natural language: "grilled chicken breast with herbs." The SegFormer-b0 FoodSeg103 (3.7M parameters, ~14MB) produces per-pixel semantic segmentation with 104 fixed category labels -- it sees the same plate and outputs class ID 47 ("chicken duck"). The GTE-small embedding model (33M parameters, ~92MB total) bridges the gap.

These models were trained on different datasets, with different objectives, using different label vocabularies. The VLM has never seen a SegFormer label. The SegFormer has never seen natural language. Yet we must determine that "grilled chicken breast with herbs" and "chicken duck" refer to the same spatial region on the plate.

Three independently trained models produce incompatible outputs that must be aligned without joint training.

The Hungarian algorithm approach

The alignment method is a four-step process. All computation happens in text embedding space -- no image features are shared between models, no joint training is required, and the algorithm is model-agnostic.

1

Encode all VLM food names using GTE-small to produce matrix V of shape (N_vlm x 384).

2

Encode all SegFormer region labels using GTE-small to produce matrix S of shape (N_seg x 384).

3

Compute cosine similarity matrix: C = V . S^T of shape (N_vlm x N_seg). Each cell represents how semantically similar a VLM name is to a SegFormer label.

4

Apply the Hungarian algorithm to find the optimal 1-to-1 assignment minimizing negative similarity. With 3-8 items per plate, the assignment problem is trivially small -- microseconds to solve via scipy.optimize.linear_sum_assignment.

The key insight is that semantic similarity between food names is sufficient for alignment even when exact labels differ. "White rice" and "rice" produce a cosine similarity of 0.933. "Meat curry" and "sauce" produce 0.807 -- because the SegFormer labeled the curry region as "sauce," which is semantically close enough for the 1-to-1 constraint to resolve correctly.

Cosine similarity matrix between VLM names (rows) and SegFormer labels (columns). The Hungarian algorithm selects the optimal 1-to-1 assignment (highlighted cells).

Validation results

We tested alignment on multi-item food photographs from MM-Food-100K. The system correctly matched all primary food items (those occupying more than 5% of plate pixels) in every test case. Of 20 test queries, 17 achieved exact match -- an 85% rate. The 3 mismatches involved minor garnishes and shared regions, each contributing less than 3% of total plate area.

VLM Name SegFormer Label Cosine Sim Correct
white rice rice 0.933 Yes
fried egg egg 0.914 Yes
braised chicken chicken duck 0.837 Yes
gravy sauce 0.823 Yes
chili sauce tomato 0.817 Yes
meat curry sauce 0.807 Yes
fried tofu pork 0.804 Yes

The "fried tofu" to "pork" match illustrates the system's resilience. FoodSeg103 lacks a tofu category entirely, so the SegFormer labeled the tofu region as "pork" (visually similar fried protein). The 1-to-1 constraint from the Hungarian algorithm resolved this correctly: with no better match available for the tofu region, the assignment is forced and correct.

Cardinality mismatch

In practice, the VLM and SegFormer rarely agree on the number of items. The SegFormer tends to over-segment (finding 5-8 regions for a plate with 3 food items), while the VLM occasionally over-identifies (listing "spring onion" as a separate item when it is a garnish atop the rice).

Case A: SegFormer over-segments (most common)

When the SegFormer produces more regions than the VLM identifies food items, the Hungarian algorithm runs on the full N_vlm x N_seg cost matrix. Each VLM item gets its best-matching SegFormer region. Unmatched SegFormer regions are merged into the VLM item with highest embedding similarity, and their pixel counts are added to that item's proportion.

In our test with 3 VLM items and 5 SegFormer regions, the system correctly assigned rice (87.3%), curry (9.6%), and vegetables (3.1%), merging the extra "french fries" and "bread" regions (both SegFormer mislabels) into their closest VLM matches.

Case B: VLM over-identifies (less common)

When the VLM identifies more items than the SegFormer detected, the algorithm runs on the transposed matrix. Each SegFormer region gets its best-matching VLM item. Unmatched VLM items receive an equal share of their closest SegFormer region's pixels. This handles garnishes and small additions that the SegFormer grouped into a larger region.

Cardinality mismatch resolution. Left: SegFormer over-segments (5 regions, 3 foods). Right: VLM over-identifies (5 items, 5 regions). Both resolve correctly.

GTE-small vs alternatives

We evaluated GTE-small against MiniLM-L6-v2, the most commonly recommended lightweight embedding model. GTE-small won on both accuracy and practical deployment characteristics despite being larger.

GTE-small vs MiniLM-L6-v2 on food label matching. GTE-small matched mpnet-base accuracy at one-sixth the size.

GTE-small achieved 17 out of 20 exact matches on test queries including difficult entries (pho, jollof rice, quarter pounder). It matched the accuracy of mpnet-base-v2 -- a model six times its size -- while maintaining a 58ms lookup time against a 29,977-food nutrition database. MiniLM-L6-v2 produced lower similarity scores on food-specific queries, particularly for non-Western foods where its training distribution is thinner.

The total GTE-small footprint is 92MB: a 70MB model plus a 22MB precomputed embedding matrix for the nutrition database. This matrix is computed once at build time and loaded at inference, eliminating the need to embed 29,977 food names at runtime.

On-device footprint

The complete model stack fits within approximately 1GB -- small enough for modern mobile devices to load all models simultaneously without memory pressure. Each component was selected for minimal size at acceptable accuracy.

Component Model Parameters Size Role
Food identification Qwen3.5-0.8B (LoRA, Q8) 800M ~850MB Names food items from image
Segmentation SegFormer-b0 FoodSeg103 3.7M 14MB Pixel-level food masks
Text matching GTE-small 33M 92MB Alignment + DB lookup
Weight estimation R10b (CurveNet + ResNet50) ~25M 30MB Total plate weight
Depth estimation Depth Anything V2 Small 24.8M 25MB Depth map for point cloud
Nutrition database -- 29,977 foods 16MB Caloric density per 100g
Total ~1,027MB
On-device model footprint breakdown. The VLM dominates at 850MB; supporting models add only 177MB total.

Why this works

The approach succeeds because of three properties. First, food names occupy a dense, well-structured region of embedding space. "Chicken breast" and "chicken" are close. "Rice" and "white rice" are close. The vocabulary of food is finite and semantically clustered, making cosine similarity a strong signal.

Second, the 1-to-1 constraint from the Hungarian algorithm is self-correcting. Even when individual similarity scores are ambiguous -- "fried tofu" is moderately similar to both "pork" and "potato" -- the global optimal assignment resolves conflicts. With only 3-8 items per plate, the combinatorial space is trivially small.

Third, SegFormer's errors are systematic and predictable. It consistently labels tofu as "potato," sauce as "tomato," and mixed proteins as "chicken duck." These systematic errors produce stable embedding distances that the Hungarian algorithm can learn to route around through the 1-to-1 constraint.

Rather than training one monolithic model that does everything poorly, we compose specialists that each excel at their task. The alignment layer is the glue -- a lightweight embedding bridge that lets independently trained models collaborate without ever sharing a gradient.

Limitations

Known constraints of the alignment approach:
  • Visual area does not equal weight. A 100g steak and 100g of lettuce have very different visual areas. The system assumes proportional area-to-weight mapping, introducing systematic error for foods with extreme density differences. Density-weighted adjustment using per-food values from the nutrition database is a planned improvement.
  • SegFormer vocabulary is limited to 103 classes. Unusual foods may be assigned to a visually similar but semantically different class. The embedding matching compensates, but edge cases exist where the wrong spatial region is selected.
  • Mixed dishes cannot be spatially segmented. Curry, soup, and stew are correctly treated as single items by the VLM. The nutrition database provides per-dish caloric density for these cases.
  • Small garnishes are filtered out. Items occupying less than 1% of food pixels fall below the SegFormer minimum threshold. Parsley, lemon wedges, and sesame seeds are excluded -- their caloric contribution is negligible.

Comparison with alternatives

Approach Accuracy Training Data Extra Size
Equal split (baseline) Poor None 0MB
VLM estimates proportions Unknown 50K+ labeled images 0MB
SAM auto-mask Poor (segments non-food) None 91MB
GroundingDINO + SAM Poor (can't detect food) None 300MB
SegFormer + GTE (ours) Good None 106MB

Our approach is the only method that requires zero proportion-specific training data, uses only lightweight models suitable for mobile deployment, correctly handles label mismatches between segmentation and identification, and gracefully resolves cardinality mismatches between over- and under-segmentation.

Doses AI Research · March 2026