A client asked me for a simple thing.
Not ChatGPT.
Not an agent.
Not a multimodal assistant that can explain invoices, generate React components, and write poetry in three languages.
Just a small classifier embedded into a website.
The job sounded boring in the best possible way:
take some text, classify it, return a result, keep it fast.
So I started looking at the usual solutions.
And then I had one of those moments where you stop reading documentation, lean back, and ask:
Are we seriously doing this?
Because the answer I kept running into looked like this:
download a huge runtime
download a huge model
initialize a big ML stack
then classify one small piece of text
In one setup, the path was getting close to something like 250 MB per user.
For a simple classifier.
On a website.
From a server.
Every time.
No. Sorry. That is insane.
The problem
The web has a strange habit now.
You ask for one small AI feature, and the answer is often:
bring the entire construction company.
But sometimes I do not need a construction company.
I need one person on the construction site.
One task.
One tool.
One result.
This is especially true for simple classification, embeddings, semantic search, routing, filtering, ranking, small local decisions.
Not every AI problem needs an LLM.
Not every website needs a full inference engine.
Not every user should pay a 250 MB download tax because we were too lazy to think smaller.
So I started digging
I wanted something simple:
- runs in the browser
- does not require a server for inference
- small enough to actually ship
- works with transformer-style models
- can tokenize text
- can run BERT-like forward inference
- can produce embeddings or classification input
- does not bring ONNX Runtime, Candle, ndarray, or half the internet with it
At first I thought:
“Surely someone already made the tiny version.”
There are great tools out there.
Transformers.js is powerful.
ONNX Runtime Web is powerful.
Candle is powerful.
But that was exactly the problem.
They are powerful because they are general.
I did not need general.
I needed narrow.
I needed small.
So I built one.
Introducing wasmicro
wasmicro is my attempt at a tiny transformer inference runtime for the web.
Current WASM bundle size:
~94 KB
Not 94 MB.
94 KB.
The project is here:
https://github.com/Xzdes/wasmicro
Live demo:
https://xzdes.github.io/wasmicro/
It is not perfect.
It is not finished.
It is still being tested.
But it already does the thing I needed: run a small transformer-style pipeline in the browser without shipping a giant runtime.
What it does today
Right now wasmicro supports:
- tiny owned tensors
- safetensors loading
- WordPiece tokenizer
- BERT encoder forward pass
- mean pooling for embeddings
- WASM bindings
- SIMD128 matmul path
- i8/u8/q4 quantized weight types
- converter tool for HuggingFace models
The design rule is simple:
if it does not make the WASM bundle smaller, faster, or useful for a transformer architecture, it probably does not belong.
No training.
No autograd.
No optimizer.
No general tensor framework.
No “module zoo”.
Just forward inference.
Why not just use a big runtime?
Because sometimes the runtime is bigger than the problem.
If I am building a serious AI application, yes, I will use serious AI infrastructure.
If I need WebGPU, many architectures, image models, audio models, generation, pipelines, fallback backends, and broad model support, then I should use the big tools.
But if I need a small classifier embedded into a website?
I do not want the entire AI ecosystem.
I want the smallest useful thing.
The difference is like hiring:
- a whole construction company
- or one worker with the correct tool
The web keeps giving me the company.
I needed the worker.
The current numbers
The WASM bundle is currently around:
94 KB after wasm-opt -Oz
The hard size ceiling I set for myself is:
250 KB
The default library dependency set is intentionally tiny.
The project avoids pulling in things like:
- ndarray
- candle
- rayon
- serde_json
- chrono
- getrandom
The converter CLI can use heavier dependencies, because it runs on a desktop machine and never ships to the browser.
The browser runtime stays small.
Is it faster?
That depends what “fast” means.
If you mean maximum throughput on GPU against a fully optimized WebGPU runtime, probably not.
That is not the fight.
The fight I care about is:
- cold start
- first useful result
- small runtime
- simple embedding/classification tasks
- CPU/WASM path
- no huge framework download
I want to compete on:
time to load
time to first embedding
runtime size
simple integration
Not on “who supports 200 model architectures”.
That is a different game.
What is missing
A lot.
This is still early.
The project still needs:
- better public benchmarks
- easier model download story
- more optimized attention path
- less allocation during forward
- better q4 loading for BERT
- cleaner zero-config examples
- more browser measurements
- more real-world classification demos
Also, the live demo currently expects you to provide model files.
That is not ideal.
But it is already enough to prove the point:
You do not always need a massive runtime to do a small AI job.
The real lesson
This started as a simple client task.
“Add a classifier to a website.”
Then I looked at the common path and saw the cost.
And I could not accept that the answer to a small feature was:
ship hundreds of megabytes and hope nobody notices.
Users notice.
Browsers notice.
Mobile connections notice.
Cold start notices.
So I built the smaller thing.
Not because it is perfect.
Because the alternative felt wrong.
Final thought
AI tooling is amazing right now.
But we are also getting lazy.
We reach for the biggest tool because it is convenient.
Sometimes that is correct.
Sometimes it is absurd.
If the job needs a crane, use a crane.
If the job needs one person with a hammer, do not send a construction company.
wasmicro is my attempt at the hammer.
Small.
Narrow.
Still rough.
But already useful.
GitHub:
https://github.com/Xzdes/wasmicro
Demo: