Building DropVox: A Local AI Transcription App for Mac

5 min read
dropvoxpythonwhispermacosindie-hacking

The Problem

My wife sends voice messages. Long ones. Two minutes of Portuguese while I'm in a meeting, unable to listen. WhatsApp has no transcription. I needed to know if it was urgent or just a story about her day.

Existing solutions? Upload to some cloud service, wait, hope they don't store my audio forever. No thanks.

I wanted something local, fast, and always available. A weekend later, DropVox was born.

The Spark: A Quick CLI

It started with 20 WhatsApp audio files I needed transcribed. I threw together a Python script using OpenAI's Whisper:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.opus", language="pt")
print(result["text"])

It worked. Beautifully. The base model was surprisingly accurate for Portuguese, and it ran entirely on my MacBook. But opening a terminal every time? Too much friction.

The Vision: Menu Bar Always Ready

What if transcription was one click away? A menu bar app that:

  1. Lives in your menu bar, always there
  2. Opens a file picker with one click
  3. Transcribes and copies to clipboard automatically
  4. Shows a notification when done

No terminal. No extra windows. Just click, select, paste.

Day 1: Learning rumps

I discovered rumps, a Python framework for macOS menu bar apps. The API is delightfully simple:

import rumps

class DropVoxApp(rumps.App):
    def __init__(self):
        super().__init__("DropVox")

    @rumps.clicked("Select Audio Files...")
    def select_files(self, _):
        # Handle file selection
        pass

if __name__ == "__main__":
    DropVoxApp().run()

That's it. A working menu bar app in 10 lines.

The Drag-and-Drop Dream (Deferred)

My original vision included a floating drop zone—drag files directly onto it. But implementing native drag-and-drop requires diving into PyObjC and AppKit. For an MVP, a file picker would do.

I used tkinter's file dialog, which works surprisingly well with rumps:

import tkinter as tk
from tkinter import filedialog

root = tk.Tk()
root.withdraw()
file_paths = filedialog.askopenfilenames(
    title="Select Audio Files",
    filetypes=[("Audio Files", "*.opus *.mp3 *.m4a *.wav")]
)

Not as elegant as drag-and-drop, but functional. Ship first, polish later.

Day 2: The Threading Challenge

Whisper transcription takes time—sometimes 30 seconds for a two-minute audio. The UI can't freeze. Threading to the rescue:

import threading

def select_files(self, _):
    # ... file picker code ...

    thread = threading.Thread(
        target=self._transcribe_files,
        args=(file_paths,),
        daemon=True,
    )
    thread.start()

The menu bar title updates during transcription: "DropVox" → "[1/3]" → "[2/3]" → "DropVox". Simple progress indication without blocking.

The Polish: Notifications and Clipboard

macOS notifications via osascript:

import subprocess

def notify(title: str, message: str):
    script = f'display notification "{message}" with title "{title}"'
    subprocess.run(["osascript", "-e", script])

Clipboard via pyperclip:

import pyperclip
pyperclip.copy(transcription_text)

Now the workflow is: click → select → wait → paste. Four steps, zero terminal commands.

Model Selection

Whisper offers models from tiny to large. I added a submenu to switch between them:

ModelSpeedAccuracySize
tinyFastestGood39M
baseFastBetter74M
smallMediumGood244M
mediumSlowGreat769M
largeSlowestBest1550M

For WhatsApp messages, base is the sweet spot. Quick enough to not feel slow, accurate enough for everyday messages.

Going Further: A Real Product

What started as a weekend project evolved into something more. I built a dedicated website at dropvox.app with:

  • PostHog Analytics - Tracking downloads and user engagement to understand what's working
  • A Blog - SEO-focused content about audio transcription, privacy, and Whisper
  • Full SEO - Structured data, sitemap, optimized meta tags for discoverability
  • i18n Support - English and Portuguese for my primary audiences

The site itself is a Next.js 16 app with Tailwind CSS v4, deployed on Vercel. Clean, fast, and informative.

What I Learned

  1. rumps is perfect for simple menu bar apps - If you need more, look at PyObjC or Swift.

  2. tkinter file dialogs work everywhere - Cross-platform and plays nice with other frameworks.

  3. Threading is essential for any processing - Never block the UI thread.

  4. Whisper is remarkable - Running a speech-to-text model locally on a laptop, with this accuracy, felt like magic.

  5. MVPs should be embarrassingly simple - File picker instead of drag-and-drop. Ship it.

  6. Analytics matter - Even for side projects, understanding user behavior helps prioritize features.

What's Next

DropVox works. I use it daily. But the roadmap is tempting:

  • Native drag-and-drop - A floating window you can drop files onto
  • Swift rewrite - Better performance, true native feel
  • Share Extension - Transcribe directly from WhatsApp's share menu
  • App Store - Maybe, if there's demand

For now, it's open source: github.com/helsky-labs/dropvox

Try It

Download the latest release from dropvox.app or build from source:

git clone https://github.com/helsky-labs/dropvox.git
cd dropvox
python3 -m venv venv
source venv/bin/activate
pip install -e .
dropvox

The first run downloads the Whisper model (~140MB for base). After that, it's instant.


Building something? I'm always happy to chat. Find me on GitHub.