Frontier-scale language models that run on the customer's own hardware.

Live capture: Wisteria-122B-A10B running on a MacBook M3 with 18 GB RAM through trellis.cpp — our custom on-device inference engine. Greedy decoding, fully offline.

Doses AI is starting with 100-billion-parameter mixture-of-experts language models at 1.58-bit ternary precision - small enough to run on a MacBook, accurate enough to matter for hospitals, banks, government agencies, and other organisations whose data has to stay on their own systems.

Wisteria-122B-A10B is our first model. 122 billion total parameters, 10 billion active per token, 24 GB on disk. 8 tokens / sec on an 18 GB MacBook M3. 10,758 tokens / sec on a single H100. 5 billion tokens of continual pretraining - 5% of the planned 100 billion.

To our knowledge, as of May 2026, no public technical report has shown stable mixture-of-experts recovery at this scale under ternary compression. Prior public ternary work - Microsoft's BitNet b1.58, TII's Falcon-Edge family - tops out around ten billion parameters on dense architectures. The four sections below walk through how the recovery worked, what the routing looks like, where the capability sits today, and what's left.

1.58

Bits / Weight

122B

Total Params

24 GB

On Disk

8 tok/s

MacBook M3

Ternary recovery

Standard language models store their weights as 16-bit floating-point numbers. Wisteria stores each weight as one of three values - {-1, 0, +1} - using 1.58 bits per weight. This is the same representation Microsoft used in their BitNet b1.58 paper and TII used in their Falcon-Edge family. To our knowledge, as of May 2026, no public technical report has applied this representation to a mixture-of-experts architecture above ten billion parameters. Wisteria runs it across 122 billion.

The model is not trained from scratch in ternary. It is recovered: we start from a 16-bit mixture-of-experts parent, quantise its weights to ternary, and continue training. Across the first 5 billion tokens of continual pretraining - 5% of the planned 100 billion - cross-entropy loss fell from 12.64 to 1.87. The resulting 24-gigabyte file is small enough to run on the customer's own hardware, with no data leaving their network.

Steady-state recovery loss · log-log. Per-step measurements (faint), rolling mean (bold), and a power-law projection carried through the learning-rate cooldown to 100 billion tokens (dashed). Projected loss at 100B: ~1.66.

MoE routing

A mixture-of-experts model routes each token through a small subset of specialised sub-networks. Wisteria has 256 of these experts; each token is routed to a few of them. Under heavy compression, some experts can go silent - they stop being selected, never get updated, and effectively disappear from the model. This is "expert collapse," and it is one of the standard failure modes when compressing MoE architectures.

At training step 10, twenty of Wisteria's 256 experts were collapsed and effective routing utilisation was 186 / 256. All experts were back online by step 70. By the end of the 5B-token continual pretrain, all 256 experts were active, zero were collapsed, routing entropy was 0.997, and load coefficient of variation was 0.20 - effectively uniform balance. To our knowledge, as of May 2026, no public technical report has demonstrated stable MoE recovery at this scale under ternary compression. The recipe - how training was scheduled, how experts were sharded across hardware, how routing was stabilised - is what makes it practical for a customer to retrain the model on their own data, on their own machines, without that data leaving the network.

Effective routing utilisation (entropy-weighted, of 256, left axis) and discrete collapsed experts (right axis, dashed). Log-scale training step, measured across the full 5-billion-token continual pretrain. Endpoint: all 256 experts active, 0 collapsed, routing entropy 0.997, load coefficient of variation 0.20 — effectively uniform load balance.

Inference performance

Ternary models are usually served by decompressing the weights back to floating-point at inference time. We took a different path and wrote the inference stack ternary-first: GPU and CPU kernels that operate on the {-1, 0, +1} weights directly, without ever materialising a floating-point copy. Two backends share that stack - one for NVIDIA datacenter GPUs (CUDA), one for Apple Silicon (Metal). We call the engine trellis.cpp.

End-to-end decode throughput on Wisteria-122B across five rounds of kernel work on a single H100 — 777 → 10,758 sequence tokens / sec (13.8×).

MacBook M3 decode of Wisteria-122B at 8 tokens / sec, fully offline. 24 GB on disk, 18 GB RAM, experts streamed from SSD on demand.

The same engine runs Wisteria on a single H100 for datacenter inference and on an 18 GB MacBook M3 fully offline, with no network connection. The video at the top of the page is a live capture of the MacBook backend.

Capability

Wisteria has had 5 billion tokens of continual pretraining (5% of the planned 100 billion) and one supervised fine-tuning pass. The numbers below come from 50-question diagnostic subsets of four standard benchmarks, which we run as a fast capability signal during training. The 95% confidence interval at n = 50 is approximately ±14 points, so these are directional readings, not defensible parity claims.

After 5B-token continual pretraining alone, Wisteria scores 80 / 78 / 46 / 44 on ARC-Easy / BoolQ / MMLU-Redux / OpenBookQA. After one SFT pass those move to 82 / 78 / 52 / 52, against GPT-OSS-120B's 82 / 78 / 62 / 84 on the same subsets. The gap on MMLU-Redux and OpenBookQA is real and reflects both the remaining 95 billion tokens of pretraining and the data-mixture work that hasn't run yet — the knowledge tests depend on the breadth of training data more than the amount, and the wider-corpus run hasn't happened yet.

Wisteria-122B after 5B-token continual pretrain (faint) and one SFT pass (solid) against GPT-OSS-120B (dashed outline). n = 50 per benchmark; 95% CI is approximately ±14 percentage points.

What's next

Full-suite benchmarks. The numbers in the section above are 50-question diagnostic subsets, used as a fast capability signal during training. Full-suite re-runs on ARC-Easy (5,197 questions), BoolQ (3,270), OpenBookQA (500), and MMLU-Redux (~5,800) are the next deliverable and will replace these diagnostic numbers.
Recovery delta to the 16-bit parent. The natural reference for Wisteria is the FP16 mixture-of-experts parent the recipe was applied to. Once Wisteria has finished its 100B-token pretraining, the delta against the parent's published scores tells you the actual capability cost of ternary compression — until then, any gap mixes compression cost with undertraining cost.
Scaling-law fit, then a 1T-parameter run. The log-log fit on the recovery loss curve in Section I, carried through the learning-rate cooldown, projects cross-entropy loss of ~1.66 at 100 billion tokens, consistent with the slope holding from 5B to the full pretrain budget. Whether that same slope carries across a 10× model-size jump to one trillion parameters is the empirical question; the scan that answers it runs after Wisteria's 100B pretrain completes. The trillion-parameter claim is a hypothesis backed by a fit, not a proven result.
Deployment unit. Trellis.cpp ships as a stand-alone runtime (CUDA + Metal). The deployment unit is the customer running Wisteria on their own servers — we ship the model artefact, the runtime, and the recovery recipe; the customer's data never leaves their network.
Post-training. The published checkpoint has had one supervised fine-tuning pass on a small instruction-following dataset. Full instruction tuning, preference optimisation, and customer-domain fine-tunes are scheduled once the pretrain completes.

What we're raising for

Doses AI is recovering 100-billion-parameter language models into ternary form so customers can run them on their own hardware. Wisteria-122B-A10B is our first model.

We're raising a seed round to complete Wisteria's pretraining (95 billion tokens remaining), run full-suite benchmarks, and test whether the recipe scales to one trillion parameters.

Doses AI was founded in London and received a $100,000 Llama Impact Grant from Meta in 2025.

Contact founder HuggingFace

Doses AI · 2026 · privacy