A simple React hook for running local LLMs via WebGPU

dev.to

Running AI inference natively in the browser is the holy grail for reducing API costs and keeping enterprise data private. But if you’ve actually tried to build it, you know the reality is a massive headache.

You have to manually configure WebLLM or Transformers.js, set up dedicated Web Workers so your main React thread doesn't freeze, handle browser caching for massive model files, and write custom state management just to track the loading progress. It is hours of complex, low-level boilerplate before you can even generate a single token.

I got tired of configuring the same WebGPU architecture over and over, so I wrapped the entire engine into a single, drop-in React hook: react-brai.

Initialize the engine. The hook automatically handles Leader/Follower negotiation based on multiple active tabs.

import { useLocalAI } from 'react-brai';

export default function Chat() {
  const { loadModel, chat, isReady, tps } = useLocalAI();

  useEffect(() => { 
      loadModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'); 
  }, []);

  return <div>Speed: {tps} T/s</div>
}
Enter fullscreen mode Exit fullscreen mode

Now just use the loaded model liek this,

const response = await chat([
  { role: "system", content: "Output JSON: { sentiment: 'pos' | 'neg' }" },
  { role: "user", content: "I love this library!" }
]);
const data = JSON.parse(response);
Enter fullscreen mode Exit fullscreen mode

It abstracts away the web worker delegation, the model caching, and the memory constraints. You just call the hook, pick a quantized SLM (like Llama-3B), and start generating text or extracting JSON.

The Browser Cache
Let me be brutally honest, This is not for lightweight, general-purpose landing pages. react-brai requires the user to download a ~1.5GB to 3GB model into their local browser cache on the first load.

But for high-profile, niche use cases, that initial heavy download is an incredibly cheap price to pay.

Where this actually makes sense

  • Heavy B2B Dashboards: The user logs in daily. They eat the download cost once, and forever after, their inference is instant and offline.
  • Enterprise Data Privacy: When strict rules prevent you from sending customer data to OpenAI, local WebGPU inference is your only secure option.
  • Automated JSON Extraction: Constantly formatting and extracting JSON from large datasets without burning through API tokens.

Try it out
I’ve published the package on NPM and set up a live playground. I’d love for fellow React devs to test the implementation and let me know how the memory management handles on your hardware.

NPM: https://www.npmjs.com/package/react-brai

Live WebGPU Playground: https://react-brai.vercel.app

Source: dev.to

arrow_back Back to News