Bonsai 1-bit: An 8B LLM that fits in 1 GB

PrismML just released something that caught my attention: an 8B-parameter LLM that's only 1.15 GB. They managed to squeeze 8 billion parameters into a file small enough to fit on an iPhone. I spent some time looking into how it works and tried running it locally.

Table of Contents

What does a "1-bit" LLM mean
Does it actually work?
It's fast
Smaller variants
How to run it
Why it stands out
A few caveats
Wrapping up

What does a "1-bit" LLM mean

If you've run local models before, you're probably familiar with quantization. A standard 8B model in 16-bit precision takes up roughly 16 GB. Quantize it to 4-bit and you're down to about 4 GB.

These are post-training techniques that approximate the original weights to save space, at the cost of some accuracy.

Bonsai does something different. Every weight in the entire network is either -1 or +1. That's it. One bit per weight. And unlike typical quantization, this isn't applied after training. The model is trained natively at 1-bit precision. Embeddings, attention layers, and the language model head are all 1-bit, end to end.

Kinda related, you might remember Microsoft's BitNet research, which used 1.58-bit ternary weights (-1, 0, +1). Having that zero value lets the network effectively "turn off" connections. The difference here is that Bonsai drops it entirely. Every weight must be -1 or +1, no off switch.

Does it actually work?

This is the part I was most curious about. PrismML's own benchmarks show Bonsai 8B competitive with other 8B-class models, but self-reported benchmarks only tell you so much.

Here's what the fine folks at r/LocalLLaMA are saying:

General chat and Q&A works well, surprisingly coherent for a 1 GB model
Email drafting, simple math, story writing: handles these fine
Factual questions: hallucinates on some (not unusual for small models)
Complex coding or structured JSON output: struggles, as you'd expect at this size

To put the size in perspective: Bonsai 8B at 1.15 GB delivers roughly similar benchmark scores to Ministral 3B and Qwen3 4B, models that are 6-8 GB in 16-bit. The raw accuracy isn't groundbreaking, but the ratio of capability to size is.

It's fast

Because the model is tiny, it flies on consumer hardware. According to PrismML's own benchmarks:

Device	Tokens/sec
RTX 4090	368 tok/s
M4 Pro Mac	131 tok/s
iPhone 17 Pro Max	44 tok/s
iPhone 17 Pro	~40 tok/s

For reference, a standard 16-bit 8B model can't even fit on any current iPhone. Bonsai reportedly runs on one at 40 tokens per second.

Smaller variants

PrismML also released two smaller models in the same family: 1-bit Bonsai 4B and 1-bit Bonsai 1.7B. Both use the same 1-bit approach and are available on Hugging Face.

How to run it

Here's the catch: you can't just drop this into LM Studio or stock llama.cpp today. Mainstream inference engines don't support 1-bit weights yet, so it's kind of a battle to try it out.

What works right now:

Mac / iPhone / iPad: PrismML's fork of MLX with 1-bit kernel support
NVIDIA GPUs: A fork of llama.cpp with CUDA support
iOS: The Locally AI app supports it (I tried it and it's working well on an iPhone 17 Pro)
Browser: You can try it directly via Hugging Face

What doesn't work yet:

I tried loading the MLX variant in LM Studio and got:

Failed to load model:

ValueError: [quantize] The requested number of bits 1
is not supported. The supported bits are 2, 3, 4, 5, 6 and 8.

You can download it, but it won't load yet (current issue on GitHub). This will likely be resolved once the MLX upstream changes land. Also, Ollama doesn't yet officially support it.

Why it stands out

Bonsai isn't going to replace your cloud API for serious workloads. It's an 8B model. But that's not really the point.

What's interesting is the proof of concept. If you can get useful intelligence out of a 1 GB model running at 40 tok/s on a phone, that opens up a lot of doors: offline assistants, on-device agents, privacy-sensitive applications, edge devices, robotics.

There's also a hardware angle. LLM inference is heavily bottlenecked by memory bandwidth. The GPU can spend more time waiting for weights to arrive from RAM than it does actually crunching numbers. Moving 1.15 GB of weights is faster than shuttling 16 GB around. On top of that, 1-bit weights turn the matrix math in linear layers into simple additions instead of multiplications, so the compute itself is cheaper too.

That plus custom silicon could push things much further.

A few caveats

Before you get too excited:

The training method is proprietary: PrismML's whitepaper describes the results but not the full compression pipeline. You can use the models freely (Apache 2.0), but you can't easily reproduce how they were made. That's a fair tradeoff for free weights, but worth knowing.
Benchmarks are self-reported: Community testing is encouraging but still early. Don't expect this to replace larger models for complex tasks.
1.15 GB is just the weights: It's not entirely clear whether the 1-bit architecture only applies to the model's static weights or also the KV cache (the memory the model uses to track your conversation), which may still run in higher precision.
Ecosystem support is early. You'll need forked tools for now. Once upstream PRs to MLX and llama.cpp land, this friction should go away.

Wrapping up

Bonsai isn't the most capable model you can run locally, but IMO it might be one of the most interesting.

What I'm most curious about is what happens when this technique scales up. A 70B model at this compression ratio would be around 10 GB, well within reach of a laptop. If the quality holds, that's a big deal.

Runpod

Our sponsor

Spin Up a GPU in Seconds - No Setup Required

Deploy AI Models Globally in 30+ Regions

Scale from 0 to 1000 GPUs on Demand

PyTorch, Tensorflow & Docker Ready

Learn more