I Was Asked to Add a Simple Classifier to a Website. Then I Saw the 250 MB Download.

A client asked me for a simple thing.

Not ChatGPT.

Not an agent.

Not a multimodal assistant that can explain invoices, generate React components, and write poetry in three languages.

Just a small classifier embedded into a website.

The job sounded boring in the best possible way:

take some text, classify it, return a result, keep it fast.

So I started looking at the usual solutions.

And then I had one of those moments where you stop reading documentation, lean back, and ask:

Are we seriously doing this?

Because the answer I kept running into looked like this:

download a huge runtime

download a huge model

initialize a big ML stack

then classify one small piece of text

In one setup, the path was getting close to something like 250 MB per user.

For a simple classifier.

On a website.

From a server.

Every time.

No. Sorry. That is insane.

The problem

The web has a strange habit now.

You ask for one small AI feature, and the answer is often:

bring the entire construction company.

But sometimes I do not need a construction company.

I need one person on the construction site.

One task.

One tool.

One result.

This is especially true for simple classification, embeddings, semantic search, routing, filtering, ranking, small local decisions.

Not every AI problem needs an LLM.

Not every website needs a full inference engine.

Not every user should pay a 250 MB download tax because we were too lazy to think smaller.

So I started digging

I wanted something simple:

runs in the browser
does not require a server for inference
small enough to actually ship
works with transformer-style models
can tokenize text
can run BERT-like forward inference
can produce embeddings or classification input
does not bring ONNX Runtime, Candle, ndarray, or half the internet with it

At first I thought:

“Surely someone already made the tiny version.”

There are great tools out there.

Transformers.js is powerful.

ONNX Runtime Web is powerful.

Candle is powerful.

But that was exactly the problem.

They are powerful because they are general.

I did not need general.

I needed narrow.

I needed small.

So I built one.

Introducing wasmicro

wasmicro is my attempt at a tiny transformer inference runtime for the web.

Current WASM bundle size:

~94 KB

Not 94 MB.

94 KB.

The project is here:

https://github.com/Xzdes/wasmicro

Live demo:

https://xzdes.github.io/wasmicro/

It is not perfect.

It is not finished.

It is still being tested.

But it already does the thing I needed: run a small transformer-style pipeline in the browser without shipping a giant runtime.

What it does today

Right now wasmicro supports:

tiny owned tensors
safetensors loading
WordPiece tokenizer
BERT encoder forward pass
mean pooling for embeddings
WASM bindings
SIMD128 matmul path
i8/u8/q4 quantized weight types
converter tool for HuggingFace models

The design rule is simple:

if it does not make the WASM bundle smaller, faster, or useful for a transformer architecture, it probably does not belong.

No training.

No autograd.

No optimizer.

No general tensor framework.

No “module zoo”.

Just forward inference.

Why not just use a big runtime?

Because sometimes the runtime is bigger than the problem.

If I am building a serious AI application, yes, I will use serious AI infrastructure.

If I need WebGPU, many architectures, image models, audio models, generation, pipelines, fallback backends, and broad model support, then I should use the big tools.

But if I need a small classifier embedded into a website?

I do not want the entire AI ecosystem.

I want the smallest useful thing.

The difference is like hiring:

a whole construction company
or one worker with the correct tool

The web keeps giving me the company.

I needed the worker.

The current numbers

The WASM bundle is currently around:

94 KB after wasm-opt -Oz

The hard size ceiling I set for myself is:

250 KB

The default library dependency set is intentionally tiny.

The project avoids pulling in things like:

ndarray
candle
rayon
serde_json
chrono
getrandom

The converter CLI can use heavier dependencies, because it runs on a desktop machine and never ships to the browser.

The browser runtime stays small.

Is it faster?

That depends what “fast” means.

If you mean maximum throughput on GPU against a fully optimized WebGPU runtime, probably not.

That is not the fight.

The fight I care about is:

cold start
first useful result
small runtime
simple embedding/classification tasks
CPU/WASM path
no huge framework download

I want to compete on:

time to load
time to first embedding
runtime size
simple integration

Not on “who supports 200 model architectures”.

That is a different game.

What is missing

A lot.

This is still early.

The project still needs:

better public benchmarks
easier model download story
more optimized attention path
less allocation during forward
better q4 loading for BERT
cleaner zero-config examples
more browser measurements
more real-world classification demos

Also, the live demo currently expects you to provide model files.

That is not ideal.

But it is already enough to prove the point:

You do not always need a massive runtime to do a small AI job.

The real lesson

This started as a simple client task.

“Add a classifier to a website.”

Then I looked at the common path and saw the cost.

And I could not accept that the answer to a small feature was:

ship hundreds of megabytes and hope nobody notices.

Users notice.

Browsers notice.

Mobile connections notice.

Cold start notices.

So I built the smaller thing.

Not because it is perfect.

Because the alternative felt wrong.

Final thought

AI tooling is amazing right now.

But we are also getting lazy.

We reach for the biggest tool because it is convenient.

Sometimes that is correct.

Sometimes it is absurd.

If the job needs a crane, use a crane.

If the job needs one person with a hammer, do not send a construction company.

wasmicro is my attempt at the hammer.

Small.

Narrow.

Still rough.

But already useful.

GitHub:

https://github.com/Xzdes/wasmicro

Demo:

https://xzdes.github.io/wasmicro/