I shipped a free Whisper transcription web app, then a ChatGPT GPT to feed it

javascript dev.to

Last month I had a 47-minute Korean podcast I wanted English subtitles for. I opened TurboScribe, hit my "3 free files per day" wall, and stared at the $20/month upsell. I needed this once. Maybe twice a quarter. Paying $240 a year so I could occasionally convert audio into text felt like buying a treadmill to use at Christmas.

So I closed the tab and did the thing every developer does when faced with a SaaS paywall: I asked whether I really needed the SaaS at all. The answer this time was no, and that turned into whisperweb.dev — a free Whisper transcription web app, a TurboScribe alternative that runs the model in your browser instead of in someone else's data center — and then a ChatGPT GPT layered on top of it. This is the story of why both exist and how they work together.

The problem with paid transcription tools

Otter.ai gives you 300 minutes a month, but only if you sign up, and only if your meetings are in English. Rev wants $0.25 per minute for their AI tier and you're handing them your audio. TurboScribe is the most generous of the bunch, but the free tier caps you at 30 minutes per file and 3 files a day, and the $20/month tier feels designed for people who transcribe full-time, not for someone who has one Korean podcast.

There's also a quieter problem nobody talks about: every one of these services uploads your audio to their servers. For a public podcast that's fine. For a recording of a job interview, a doctor's appointment, an internal meeting, or anything with a name in it, you're now trusting a third party with content you'd never email anyone.

The model these tools all use under the hood is Whisper, OpenAI's speech-to-text model that they open-sourced in late 2022. The weights are public. The inference code is public. There's no proprietary moat between you and the transcription. The only reason these companies can charge $20/month is that running Whisper has historically required a GPU server somewhere.

That stopped being true about eighteen months ago, when WebGPU shipped in Chrome.

The build: Whisper in the browser

Here's the technical pitch in one paragraph. Modern browsers (Chrome, Edge, and recent Safari) expose the GPU through the WebGPU API. Hugging Face's transformers.js library compiles Whisper to ONNX and runs it on top of WebGPU when available, falling back to WebAssembly with SIMD when it isn't. The model weights — somewhere between 75 MB for tiny and 1.5 GB for large-v3 — get downloaded once, cached in IndexedDB, and never need to be fetched again. Inference happens locally. The audio file never leaves the tab.

The high-level flow looks roughly like this:

import { pipeline } from '@huggingface/transformers';

const transcriber = await pipeline(
  'automatic-speech-recognition',
  'onnx-community/whisper-large-v3-turbo',
  { device: 'webgpu', dtype: 'fp16' }
);

const result = await transcriber(audioBuffer, {
  language: 'auto',          // 100+ languages, auto-detected
  task: 'transcribe',        // or 'translate' to go straight to English
  return_timestamps: true,   // we need these for SRT/VTT export
  chunk_length_s: 30,
});
Enter fullscreen mode Exit fullscreen mode

That's the actual core. The other 95% of the codebase is the boring stuff: a chunked file reader so we can handle 200 MB uploads without blowing out the browser's heap, a worker thread so the UI doesn't freeze during inference, an SRT/VTT/DOCX/PDF exporter, an IndexedDB-backed dashboard so transcripts persist across sessions, and a UI translated into 13 languages.

The model selection took me longer than the architecture. I tried whisper-tiny (fast but mangles anything with an accent), whisper-base (better but still misses technical vocabulary), whisper-small (good balance), and finally whisper-large-v3-turbo, which is what I shipped as the default. On an M2 MacBook Air, large-v3-turbo transcribes a 30-minute audio file in about 90 seconds via WebGPU. On a four-year-old Windows laptop without WebGPU, the same file via WebAssembly takes around 5 minutes. Slower, but still free, still local, and still no signup.

What the product actually does

Here's the concrete shape of the free tier:

  • Upload an audio or video file up to 200 MB or 20 minutes long
  • 100+ source languages, auto-detected if you don't pick one
  • Choose between transcribe (keep original language) or translate (output English)
  • Export as Word (DOCX), PDF, plain text, SRT subtitles, or VTT subtitles
  • No account, no email, no credit card

The Unlimited tier exists because some workloads genuinely don't fit in the browser. If you have a 4-hour board meeting recording or a 3 GB raw camera file, your laptop fan is going to take off and the inference might still time out. So Unlimited ($20/month, or $10/month billed yearly) routes those jobs to a GPU on the server side — up to 10 hours and 5 GB per file, batch upload of 50 files at a time, and cross-device sync via the dashboard. It's the same Whisper model, just running on someone else's machine. I priced it the same as TurboScribe so the comparison is honest.

A rough comparison for the people skimming:

Whisper Web Free Whisper Web Unlimited TurboScribe Free TurboScribe Pro
Price $0 $10–20/mo $0 $20/mo
Signup required No Yes Yes Yes
Per-file limit 200 MB / 20 min 5 GB / 10 hr 30 min 10 hr
Files per day Unlimited Unlimited 3 Unlimited
Audio leaves device (free) No n/a Yes n/a
SRT / VTT export Yes Yes Pro only Yes

The thing I'm proudest of is row 5. On the free tier, your audio never touches a server. That's not a marketing claim, it's an architectural fact — there's no upload endpoint to send it to.

Why I also published a ChatGPT GPT

Building the app was the easy part. Getting people to find it is the hard part, and the discovery surface for utility tools has shifted over the last year. People who used to Google "free transcription app" are increasingly asking ChatGPT instead. So I wrote a GPT and submitted it to the GPT Store: Whisper Web – Free AI Speech-to-Text & Translation.

It just got published in the Productivity category. A few things to be clear about, because the GPT format invites a lot of misconceptions:

The GPT does not transcribe audio inline. ChatGPT's GPT framework can't run Whisper, can't take a file upload and return an SRT, and pretending otherwise would be misleading. What the GPT can do is answer questions ("what's the best way to transcribe a Spanish-language interview to English?", "do I need to install anything?", "can I transcribe a YouTube video?") with web search enabled, then point users to whisperweb.dev when they actually need to run a job.

In other words, the GPT is a discovery + Q&A layer. ChatGPT users find it in the GPT Store, ask it questions, and end up at the web app where the actual work happens. It's also verified as built by whisperweb.dev (OpenAI checks domain ownership for GPTs), so the funnel is honest about who's behind it.

If you're a developer reading this and thinking about whether a GPT is worth publishing for your tool: the answer depends entirely on whether your tool has a question-shaped surface area. "How do I transcribe audio to text?" is a question. "Resize this image" is not. Whisper Web sits clearly in the first bucket, which is why the GPT angle works.

Three things people are actually using it for

Podcast transcripts for show notes. I dropped a 38-minute episode of a tech podcast in. Got the English transcript in about 2 minutes (large-v3-turbo, M2). Pasted into the show notes, lightly edited, done. Total cost: $0. Equivalent on Rev: $9.50.

Foreign-language interviews to English. This is where Whisper genuinely shines and where I think it beats the paid tools. I had a 30-minute Japanese podcast a friend recommended. I selected "translate" instead of "transcribe", uploaded it, and got an English transcript directly — Whisper does the speech-to-English step in one pass rather than transcribing-then-translating. Quality was good enough to read for comprehension. Not publishable, but I understood what they were saying.

SRT subtitles for videos. Drop in a video file, get back a .srt with timestamps, drag it into Premiere or YouTube Studio. The free SRT export is the feature that gets the most "wait, this is free?" reactions, because most competitors paywall it.

What it doesn't do well

Honest section, because there are real gaps.

No speaker diarization. If you have a 4-person meeting recording and you want "Speaker 1: ... Speaker 2: ...", Whisper Web won't give you that. Whisper itself doesn't do diarization; you need a separate model (pyannote) for that, and I haven't shipped it. Otter and Rev do this and do it well. If you transcribe meetings for a living, those tools are still worth the money.

No live captions. This is a batch tool. You upload a file, you wait, you get a transcript. There's no real-time mode, no streaming-from-mic mode. WebGPU latency makes streaming technically possible, but I haven't built the UI for it yet.

Accuracy degrades on heavy accents and overlapping speech. Whisper is the best open-source ASR by a wide margin, but it's not magic. A clearly-recorded podcast in standard American English transcribes near-perfectly. A noisy phone call in Glaswegian English with two people talking over each other does not. This is true of every tool on the market — they're all running Whisper or something similar — but I want to be straight about it.

WebGPU isn't everywhere yet. On Chrome, Edge, and Arc you're set. On Safari, WebGPU shipped in 18.4 (early 2025) but is still being rolled out. Firefox is behind. The WebAssembly fallback works on every browser, but it's roughly 3–4x slower. If you have a 2017 ThinkPad and Firefox, this app will be usable but not snappy.

Try it, break it, tell me what's wrong

Here's the deal. The web app is at https://whisperweb.dev — drop a file in, no signup, see what comes out. The ChatGPT GPT is at https://chatgpt.com/g/g-69edda3418a08191999b4de9464bb6ec-whisper-web-free-ai-speech-to-text-translation if you'd rather poke at it from inside ChatGPT. Both are free.

What I actually want from this post: bug reports, accuracy comparisons against your current tool, and feature requests. Speaker diarization is the most-asked-for missing piece and I'm working on it. If there's something else, tell me — there's a contact link in the footer of whisperweb.dev, or just leave a comment on this post.

The bigger lesson I took from building this isn't really about Whisper or WebGPU. It's that a non-trivial chunk of the SaaS economy is built on the fact that until recently, ML models needed servers. That's slowly stopping being true. Browsers are quietly turning into inference runtimes, and every tool that's a thin wrapper around an open-weights model is going to feel that. I built one for transcription. I'm betting there are a hundred more waiting to be built for the categories nobody's noticed yet.

If you build one of them, ping me. I'd love to read about it.

Source: dev.to

arrow_back Back to Tutorials