TTS With Piper & WASM!

Let me be real with you. You're probably not going to read this article. Simply because reading long-form content on the internet in 2026 is a lost art. People scroll. People skim. People screenshot the title, post it on socials, and form a whole opinion based on vibes. The attention economy won, and reading lost.

I'm kinda judging but also I'm a "hey Google, read me this article" guy. I'll be making dinner, doing the dishes, walking around, and I just want someone - or something - to read stuff to me. I consume 90% of my long-form content through my ears. If your article doesn't have audio, there's a good chance I'll save it, forget about it, and discover it six months later in a browser tab graveyard alongside 200 other pages I "meant to get to."

So when I looked at my own blog and realized I was asking people to do the one thing they demonstrably do not do anymore - read - and I figured I should probably do something about it.

The Dream

Here's the thing. I've had this idea for years. Not a vague "it'd be cool to have TTS" idea - I mean a specific, fully imagined custom audio player that I've been designing in my head every time I used some garbage text-to-speech implementation on someone else's site. Every clunky play button, every full-page reload that kills your audio, every robotic voice that mispronounces basic words - I was sitting there thinking "if I ever build this, here's exactly how I'd do it."

And then I did it. And honestly? This was the most fun side project I've had in a while. Mini dev bucket list item: ✅ crossed out.

The requirements were clear:

Runs entirely on the client side - no server, no API calls, no third-party services listening to what my readers are doing
Actually sounds better than the accessability nonsense you get from the browser's built-in speech API
Lightweight enough to not murder people's phones
Has a proper streaming player with chunk navigation, not just a single play button that dumps 20 minutes of audio at you
Works with my existing htmx-powered SPA navigation without breaking
Doesn't do anything until you actually click play

Why Not Just Use the Browser's Built-In TTS?

Every browser ships with a speechSynthesis API. It's part of the Web Speech API standard. You'd think this would be the obvious answer - zero dependencies, zero downloads, works everywhere. And technically, it does work everywhere. The problem is that it sounds terrible everywhere differently.

The quality across browsers and platforms is inconsistent, the voice selection is unpredictable, and most importantly - it just doesn't sound like something you'd choose to listen to for 10 minutes. It sounds like something you'd tolerate if you had no other option. In 2026 with all the progress there is in this area, I think these browser APIs are begging for an update...

I kept it as a fallback for low-memory devices (more on that later), but it was never going to be the main engine.

Enter Piper: Neural TTS in WebAssembly

I wanted to learn about WebAssembly anyway, so this was the perfect excuse. The answer turned out to be Piper - a fast, local neural text-to-speech engine originally built for the Home Assistant ecosystem. It uses VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) and runs inference through ONNX Runtime.

Someone ported it to the browser. piper-tts-web compiles the whole pipeline - phonemizer, ONNX inference, everything - into WebAssembly that runs directly in the browser. No server. No API key. No "please sign up for our free tier that's free until it isn't." Just WASM and a ~60MB voice model downloaded from HuggingFace and cached by the browser.

I evaluated other options. Kokoro was promising, and actually more expressive, but only worked reliably at fp32 precision, which meant a ~300MB model download. At quantized precisions (q4, q8, fp16) it just... didn't work. Piper's en_US-hfc_female-medium voice clocks in at ~60MB and sounds decent enough and much better then the builtin speechSynthesis API.

The Architecture (It's Simpler Than You Think)

Here's the whole flow:

User clicks play
  > Check device memory (< 2GB? >> fall back to browser TTS)
  > Lazy-load Piper engine (~90MB WASM + phonemizer)
  > Extract article text from DOM
  > Split into chunks
  > Generate audio chunk-by-chunk (streaming)
  > Play with lookahead buffer (generate 3 ahead)
  > Evict old chunks to save memory
  > Auto-unload engine after 20s idle

Text Extraction

The player grabs the article content from the .prose[data-article-tts] container and walks through its children. It skips code blocks, images, buttons, form elements, and horizontal rules - nobody wants to hear console.log("hello world") read aloud. Lists get each item turned into a sentence. Tables get read row by row with cells joined by commas. Everything gets punctuation normalization so the voice engine pauses naturally.

Small adjacent elements get merged into chunks until they hit a minimum length of 40 characters. This prevents the engine from generating hundreds of tiny audio blips for short paragraphs.

Streaming Generation

This is where it gets interesting. The ONNX inference runs inside a Web Worker so it doesn't block the main thread - you can keep scrolling, clicking, and reading (hah) while audio generates in the background.

The engine generates one chunk at a time (the ONNX web worker doesn't support concurrent generate() calls - newer calls steal the callback from in-flight ones, which I learned the hard way). But it uses a lookahead strategy: while you're listening to chunk N, it's already generating chunks N+1 through N+3. By the time a chunk finishes playing, the next one is almost always ready.

Each generated chunk is an uncompressed WAV blob. These are large - easily 2MB+ each. On a long article with 30+ chunks, you'd be holding 60MB+ of audio blobs in memory if you kept them all. So there's a sliding window eviction policy: only keep 1 chunk behind and 3 ahead. Everything else gets revoked. If the user skips backward to an evicted chunk, it regenerates on-demand (the engine is already warm, so it's fast).

The Low-Memory Fallback

Not everyone has a device that can casually load 90MB of WASM and a 60MB voice model. The navigator.deviceMemory API tells you (roughly) how much RAM the device has - it returns rounded values like 0.25, 0.5, 1, 2, 4, 8 GB.

If the device reports less than 2GB, the player seamlessly switches to the browser's built-in speechSynthesis API. Same UI, same chunk bars, same controls - just a different engine under the hood. It's not as pretty-sounding, but it's something, and it means every device gets TTS regardless of hardware.

Firefox and Safari don't support navigator.deviceMemory (it returns undefined), so those get the full Piper experience - the assumption being that if you're on a desktop browser, you probably have enough RAM.

A Custom Player

This is the part I've been imagining for years. Here's what I wanted and what I built:

Idle state: A clean, minimal bar. Play button, "Listen to this article" label, Ctrl+Space shortcut hint. Nothing loads until you click. No 90MB payload downloaded on every page visit - that would be insane.

Active state: The player transforms into a full control bar:

Play/pause button (with a spinner during generation)
Status text - "Generating first section...", "Playing...", "Paused"
Prev/next skip buttons for chunk navigation
A chunk bar carousel - this is the bit I'm most proud of. Three horizontal bars visible at a time, representing previous/current/next chunks. They scroll smoothly as audio plays, the current chunk is highlighted, played chunks fade, and upcoming chunks are darker. It gives you a visual sense of progress without a traditional progress bar
Chunk counter (e.g., "5/23")
Close button to kill the player entirely
Media Session API integration - lock screen controls work on mobile. Play, pause, previous, next - all mapped to the hardware media keys
Keyboard shortcut: Ctrl+Space toggles play/pause globally

Playback at 0.9x was a deliberate, slightly more naturally sounding choice. Full speed AI voices feels like an auction.

After 20 seconds of being paused, the engine auto-unloads to free up memory. If you come back and hit play, it'll reload - but that's better than holding 150MB+ of WASM in memory while you've moved on to doomscrolling.

The Fun Part: What Happens When You Self-Host AI TTS

Since the whole thing runs locally and there's no rate limit, no API cost, and no usage tracking... you can throw absolutely anything at it. And because the neural voice model is trained to sound natural and expressive, it handles some very entertaining inputs.

Yaaas! Slaaay! This is fine. Everything is fine. I am not stressed at all. Bruh.

I spent an embarrassing amount of time feeding it progressively unhinged sentences just to hear what would happen. Sorry not sorry.

What I Learned

WebAssembly is genuinely impressive for this kind of thing. Running a full neural TTS pipeline in the browser - phonemization via espeak-ng compiled to WASM, ONNX inference with SIMD threading - and having it actually be fast enough for real-time streaming playback? That's hella cool. A few years ago this would have required a server with a GPU.

The browser platform still has rough edges though. speechSynthesis is an afterthought. navigator.deviceMemory isn't universally supported. Web Workers have limitations around module paths and ES imports. CSP policies for WASM + blob URLs require careful configuration. But none of it was insurmountable, and the result is a fully self-contained, privacy-respecting TTS system that runs on the edge with zero backend infrastructure.

If you're the kind of person who'd rather listen than read - first of all, same - and second, the play button is right there at the top of every article. Give it a go! Let the cringy AI do the reading for you!

Overall it's really not perfect, but I for one like it. And down the line I already have a few ideas on significantly improving both the quality and resource requirements.

If you want to do something similar for your website, hit me up, I'd be glad to share the implementation.

Live long and prosper. 🖖👽