I am working with more Spanish-speaking folks lately — and wanted live subtitles without routing audio through a cloud service. So I built a hack.

translation-overlay captures system audio from PipeWire, pipes it through a local translation model, and renders the output as floating subtitles on top of all windows.

System Audio → PipeWire capture → ML translation engine → Subtitle overlay

It’s two Python scripts duct-taped together with a shell wrapper. caption_engine.py grabs audio from your default PipeWire sink monitor via pw-record, runs it through one of three translation engines, and writes text lines to stdout. subtitle_overlay.py reads those lines and renders them as a transparent, always-on-top Qt overlay with typewriter reveal and smooth scrolling.

Engines Link to heading

EngineModelNotes
seamlessMeta SeamlessM4T v2Default. ~4GB VRAM.
whisperfaster-whisper large-v3Fast, well-tested.
canaryNVIDIA Canary 1B v2Requires NeMo toolkit.

Running it Link to heading

git clone https://github.com/jt55401/translation-overlay.git
cd translation-overlay
python3 -m venv .venv && source .venv/bin/activate
pip install torch torchaudio transformers numpy PyQt6
./start-captions.sh

That’s it. Ctrl+C to stop.

The overlay handles the usual annoyances — duplicate suppression, hallucination filtering (ML models love to emit “Thank you” and “Subscribe” into silence), and fade-out of old lines. It’s hacky, it assumes CUDA, and it’s tuned for Spanish-to-English, but the --language flag takes any source language code the model supports.

Code is on GitHub. MIT licensed.