Research Multi-Model Systems Embeddings March 29, 2026 · 6 min read

Cross-Model Alignment Without Retraining: Matching Food Labels Across Independent Models

We present a method for aligning outputs from three independently trained food recognition models without any joint training or fine-tuning. Using text embeddings and the Hungarian algorithm, we achieve 85% exact-match alignment between a VLM's natural language food descriptions and a semantic segmentation model's fixed vocabulary -- enabling accurate per-item weight distribution in multi-food images.

Independent models

85%

Exact match rate

58ms

Lookup time

1,027

Total MB on-device

Retraining required

The alignment problem

Estimating per-item calorie content from a photograph of a multi-item meal requires three capabilities: identifying each food item, estimating the weight of each item, and looking up nutritional density. Our system uses three independently trained models for these tasks -- and they speak completely different languages.

The Qwen3.5-0.8B VLM (fine-tuned, ~850MB) sees the full plate and outputs natural language: "grilled chicken breast with herbs." The SegFormer-b0 FoodSeg103 (3.7M parameters, ~14MB) produces per-pixel semantic segmentation with 104 fixed category labels -- it sees the same plate and outputs class ID 47 ("chicken duck"). The GTE-small embedding model (33M parameters, ~92MB total) bridges the gap.

These models were trained on different datasets, with different objectives, using different label vocabularies. The VLM has never seen a SegFormer label. The SegFormer has never seen natural language. Yet we must determine that "grilled chicken breast with herbs" and "chicken duck" refer to the same spatial region on the plate.

Three independently trained models produce incompatible outputs that must be aligned without joint training.

The Hungarian algorithm approach

The alignment method is a four-step process. All computation happens in text embedding space -- no image features are shared between models, no joint training is required, and the algorithm is model-agnostic.

Encode all VLM food names using GTE-small to produce matrix V of shape (N_vlm x 384).

Encode all SegFormer region labels using GTE-small to produce matrix S of shape (N_seg x 384).

Compute cosine similarity matrix: C = V . S^T of shape (N_vlm x N_seg). Each cell represents how semantically similar a VLM name is to a SegFormer label.

Apply the Hungarian algorithm to find the optimal 1-to-1 assignment minimizing negative similarity. With 3-8 items per plate, the assignment problem is trivially small -- microseconds to solve via scipy.optimize.linear_sum_assignment.

The key insight is that semantic similarity between food names is sufficient for alignment even when exact labels differ. "White rice" and "rice" produce a cosine similarity of 0.933. "Meat curry" and "sauce" produce 0.807 -- because the SegFormer labeled the curry region as "sauce," which is semantically close enough for the 1-to-1 constraint to resolve correctly.

Cosine similarity matrix between VLM names (rows) and SegFormer labels (columns). The Hungarian algorithm selects the optimal 1-to-1 assignment (highlighted cells).

Validation results

We tested alignment on multi-item food photographs from MM-Food-100K. The system correctly matched all primary food items (those occupying more than 5% of plate pixels) in every test case. Of 20 test queries, 17 achieved exact match -- an 85% rate. The 3 mismatches involved minor garnishes and shared regions, each contributing less than 3% of total plate area.

VLM Name	SegFormer Label	Cosine Sim	Correct
white rice	rice	0.933	Yes
fried egg	egg	0.914	Yes
braised chicken	chicken duck	0.837	Yes
gravy	sauce	0.823	Yes
chili sauce	tomato	0.817	Yes
meat curry	sauce	0.807	Yes
fried tofu	pork	0.804	Yes

The "fried tofu" to "pork" match illustrates the system's resilience. FoodSeg103 lacks a tofu category entirely, so the SegFormer labeled the tofu region as "pork" (visually similar fried protein). The 1-to-1 constraint from the Hungarian algorithm resolved this correctly: with no better match available for the tofu region, the assignment is forced and correct.

Cardinality mismatch

In practice, the VLM and SegFormer rarely agree on the number of items. The SegFormer tends to over-segment (finding 5-8 regions for a plate with 3 food items), while the VLM occasionally over-identifies (listing "spring onion" as a separate item when it is a garnish atop the rice).

Case A: SegFormer over-segments (most common)

When the SegFormer produces more regions than the VLM identifies food items, the Hungarian algorithm runs on the full N_vlm x N_seg cost matrix. Each VLM item gets its best-matching SegFormer region. Unmatched SegFormer regions are merged into the VLM item with highest embedding similarity, and their pixel counts are added to that item's proportion.

In our test with 3 VLM items and 5 SegFormer regions, the system correctly assigned rice (87.3%), curry (9.6%), and vegetables (3.1%), merging the extra "french fries" and "bread" regions (both SegFormer mislabels) into their closest VLM matches.

Case B: VLM over-identifies (less common)

When the VLM identifies more items than the SegFormer detected, the algorithm runs on the transposed matrix. Each SegFormer region gets its best-matching VLM item. Unmatched VLM items receive an equal share of their closest SegFormer region's pixels. This handles garnishes and small additions that the SegFormer grouped into a larger region.

Cardinality mismatch resolution. Left: SegFormer over-segments (5 regions, 3 foods). Right: VLM over-identifies (5 items, 5 regions). Both resolve correctly.

GTE-small vs alternatives

We evaluated GTE-small against MiniLM-L6-v2, the most commonly recommended lightweight embedding model. GTE-small won on both accuracy and practical deployment characteristics despite being larger.

GTE-small vs MiniLM-L6-v2 on food label matching. GTE-small matched mpnet-base accuracy at one-sixth the size.

GTE-small achieved 17 out of 20 exact matches on test queries including difficult entries (pho, jollof rice, quarter pounder). It matched the accuracy of mpnet-base-v2 -- a model six times its size -- while maintaining a 58ms lookup time against a 29,977-food nutrition database. MiniLM-L6-v2 produced lower similarity scores on food-specific queries, particularly for non-Western foods where its training distribution is thinner.

The total GTE-small footprint is 92MB: a 70MB model plus a 22MB precomputed embedding matrix for the nutrition database. This matrix is computed once at build time and loaded at inference, eliminating the need to embed 29,977 food names at runtime.

On-device footprint

The complete model stack fits within approximately 1GB -- small enough for modern mobile devices to load all models simultaneously without memory pressure. Each component was selected for minimal size at acceptable accuracy.

Component	Model	Parameters	Size	Role
Food identification	Qwen3.5-0.8B (LoRA, Q8)	800M	~850MB	Names food items from image
Segmentation	SegFormer-b0 FoodSeg103	3.7M	14MB	Pixel-level food masks
Text matching	GTE-small	33M	92MB	Alignment + DB lookup
Weight estimation	R10b (CurveNet + ResNet50)	~25M	30MB	Total plate weight
Depth estimation	Depth Anything V2 Small	24.8M	25MB	Depth map for point cloud
Nutrition database	--	29,977 foods	16MB	Caloric density per 100g
Total			~1,027MB

On-device model footprint breakdown. The VLM dominates at 850MB; supporting models add only 177MB total.

Why this works

The approach succeeds because of three properties. First, food names occupy a dense, well-structured region of embedding space. "Chicken breast" and "chicken" are close. "Rice" and "white rice" are close. The vocabulary of food is finite and semantically clustered, making cosine similarity a strong signal.

Second, the 1-to-1 constraint from the Hungarian algorithm is self-correcting. Even when individual similarity scores are ambiguous -- "fried tofu" is moderately similar to both "pork" and "potato" -- the global optimal assignment resolves conflicts. With only 3-8 items per plate, the combinatorial space is trivially small.

Third, SegFormer's errors are systematic and predictable. It consistently labels tofu as "potato," sauce as "tomato," and mixed proteins as "chicken duck." These systematic errors produce stable embedding distances that the Hungarian algorithm can learn to route around through the 1-to-1 constraint.

Rather than training one monolithic model that does everything poorly, we compose specialists that each excel at their task. The alignment layer is the glue -- a lightweight embedding bridge that lets independently trained models collaborate without ever sharing a gradient.

Limitations

Known constraints of the alignment approach:

Visual area does not equal weight. A 100g steak and 100g of lettuce have very different visual areas. The system assumes proportional area-to-weight mapping, introducing systematic error for foods with extreme density differences. Density-weighted adjustment using per-food values from the nutrition database is a planned improvement.
SegFormer vocabulary is limited to 103 classes. Unusual foods may be assigned to a visually similar but semantically different class. The embedding matching compensates, but edge cases exist where the wrong spatial region is selected.
Mixed dishes cannot be spatially segmented. Curry, soup, and stew are correctly treated as single items by the VLM. The nutrition database provides per-dish caloric density for these cases.
Small garnishes are filtered out. Items occupying less than 1% of food pixels fall below the SegFormer minimum threshold. Parsley, lemon wedges, and sesame seeds are excluded -- their caloric contribution is negligible.

Comparison with alternatives

Approach	Accuracy	Training Data	Extra Size
Equal split (baseline)	Poor	None	0MB
VLM estimates proportions	Unknown	50K+ labeled images	0MB
SAM auto-mask	Poor (segments non-food)	None	91MB
GroundingDINO + SAM	Poor (can't detect food)	None	300MB
SegFormer + GTE (ours)	Good	None	106MB

Our approach is the only method that requires zero proportion-specific training data, uses only lightweight models suitable for mobile deployment, correctly handles label mismatches between segmentation and identification, and gracefully resolves cardinality mismatches between over- and under-segmentation.

Doses AI Research · March 2026