Flagship Model Wisteria-122B-A10B
May 2026  ·  10 min read

Frontier-scale language models that run on the customer's own hardware.

Live capture: Wisteria-122B-A10B running on a MacBook M3 with 18 GB RAM through trellis.cpp — our custom on-device inference engine. Greedy decoding, fully offline.

Doses AI is starting with 100-billion-parameter mixture-of-experts language models at 1.58-bit ternary precision - small enough to run on a MacBook, accurate enough to matter for hospitals, banks, government agencies, and other organisations whose data has to stay on their own systems.

Wisteria-122B-A10B is our first model. 122 billion total parameters, 10 billion active per token, 24 GB on disk. 8 tokens / sec on an 18 GB MacBook M3. 10,758 tokens / sec on a single H100. 5 billion tokens of continual pretraining - 5% of the planned 100 billion.

To our knowledge, as of May 2026, no public technical report has shown stable mixture-of-experts recovery at this scale under ternary compression. Prior public ternary work - Microsoft's BitNet b1.58, TII's Falcon-Edge family - tops out around ten billion parameters on dense architectures. The four sections below walk through how the recovery worked, what the routing looks like, where the capability sits today, and what's left.

1.58
Bits / Weight
122B
Total Params
24 GB
On Disk
8 tok/s
MacBook M3

Ternary recovery

Standard language models store their weights as 16-bit floating-point numbers. Wisteria stores each weight as one of three values - {-1, 0, +1} - using 1.58 bits per weight. This is the same representation Microsoft used in their BitNet b1.58 paper and TII used in their Falcon-Edge family. To our knowledge, as of May 2026, no public technical report has applied this representation to a mixture-of-experts architecture above ten billion parameters. Wisteria runs it across 122 billion.

The model is not trained from scratch in ternary. It is recovered: we start from a 16-bit mixture-of-experts parent, quantise its weights to ternary, and continue training. Across the first 5 billion tokens of continual pretraining - 5% of the planned 100 billion - cross-entropy loss fell from 12.64 to 1.87. The resulting 24-gigabyte file is small enough to run on the customer's own hardware, with no data leaving their network.

Steady-state recovery loss · log-log. Per-step measurements (faint), rolling mean (bold), and a power-law projection carried through the learning-rate cooldown to 100 billion tokens (dashed). Projected loss at 100B: ~1.66.

MoE routing

A mixture-of-experts model routes each token through a small subset of specialised sub-networks. Wisteria has 256 of these experts; each token is routed to a few of them. Under heavy compression, some experts can go silent - they stop being selected, never get updated, and effectively disappear from the model. This is "expert collapse," and it is one of the standard failure modes when compressing MoE architectures.

At training step 10, twenty of Wisteria's 256 experts were collapsed and effective routing utilisation was 186 / 256. All experts were back online by step 70. By the end of the 5B-token continual pretrain, all 256 experts were active, zero were collapsed, routing entropy was 0.997, and load coefficient of variation was 0.20 - effectively uniform balance. To our knowledge, as of May 2026, no public technical report has demonstrated stable MoE recovery at this scale under ternary compression. The recipe - how training was scheduled, how experts were sharded across hardware, how routing was stabilised - is what makes it practical for a customer to retrain the model on their own data, on their own machines, without that data leaving the network.

Effective routing utilisation (entropy-weighted, of 256, left axis) and discrete collapsed experts (right axis, dashed). Log-scale training step, measured across the full 5-billion-token continual pretrain. Endpoint: all 256 experts active, 0 collapsed, routing entropy 0.997, load coefficient of variation 0.20 — effectively uniform load balance.

Inference performance

Ternary models are usually served by decompressing the weights back to floating-point at inference time. We took a different path and wrote the inference stack ternary-first: GPU and CPU kernels that operate on the {-1, 0, +1} weights directly, without ever materialising a floating-point copy. Two backends share that stack - one for NVIDIA datacenter GPUs (CUDA), one for Apple Silicon (Metal). We call the engine trellis.cpp.

End-to-end decode throughput on Wisteria-122B across five rounds of kernel work on a single H100 — 777 → 10,758 sequence tokens / sec (13.8×).
MacBook M3 decode of Wisteria-122B at 8 tokens / sec, fully offline. 24 GB on disk, 18 GB RAM, experts streamed from SSD on demand.

The same engine runs Wisteria on a single H100 for datacenter inference and on an 18 GB MacBook M3 fully offline, with no network connection. The video at the top of the page is a live capture of the MacBook backend.

Capability

Wisteria has had 5 billion tokens of continual pretraining (5% of the planned 100 billion) and one supervised fine-tuning pass. The numbers below come from 50-question diagnostic subsets of four standard benchmarks, which we run as a fast capability signal during training. The 95% confidence interval at n = 50 is approximately ±14 points, so these are directional readings, not defensible parity claims.

After 5B-token continual pretraining alone, Wisteria scores 80 / 78 / 46 / 44 on ARC-Easy / BoolQ / MMLU-Redux / OpenBookQA. After one SFT pass those move to 82 / 78 / 52 / 52, against GPT-OSS-120B's 82 / 78 / 62 / 84 on the same subsets. The gap on MMLU-Redux and OpenBookQA is real and reflects both the remaining 95 billion tokens of pretraining and the data-mixture work that hasn't run yet — the knowledge tests depend on the breadth of training data more than the amount, and the wider-corpus run hasn't happened yet.

Wisteria-122B after 5B-token continual pretrain (faint) and one SFT pass (solid) against GPT-OSS-120B (dashed outline). n = 50 per benchmark; 95% CI is approximately ±14 percentage points.

What's next

What we're raising for

Doses AI is recovering 100-billion-parameter language models into ternary form so customers can run them on their own hardware. Wisteria-122B-A10B is our first model.

We're raising a seed round to complete Wisteria's pretraining (95 billion tokens remaining), run full-suite benchmarks, and test whether the recipe scales to one trillion parameters.

Doses AI was founded in London and received a $100,000 Llama Impact Grant from Meta in 2025.

Contact founder HuggingFace
Doses AI · 2026 · privacy