WhisperKit on macOS: Integrating On-Device ML in SwiftUI

Why This Post Exists

I wrote about shipping DropVox 1.0 a few weeks ago. That post covered the full journey from Python prototype to native Swift app. This post zooms into the single most important technical decision in the project: choosing and integrating WhisperKit for on-device speech transcription.

If you are building a macOS or iOS app that needs speech-to-text and you care about privacy, latency, or offline capability, this is the practical guide I wish I had when I started.

What WhisperKit Is

WhisperKit is a Swift package built by Argmax that runs OpenAI's Whisper speech recognition models on Apple hardware. The key difference from running Whisper in Python or via whisper.cpp is that WhisperKit compiles the models to CoreML format and executes them on Apple's Neural Engine.

This matters because the Neural Engine is a dedicated hardware accelerator that sits alongside the CPU and GPU on Apple Silicon chips. It is purpose-built for machine learning inference. When WhisperKit routes computation through the Neural Engine, it consumes less power and completes faster than either CPU or GPU execution paths for the same model.

The Swift API is clean and well-designed:

import WhisperKit

let whisperKit = try await WhisperKit(
    model: "openai_whisper-base",
    computeOptions: .init(audioEncoderCompute: .cpuAndNeuralEngine)
)

let result = try await whisperKit.transcribe(audioPath: fileURL.path)
let transcription = result.map { $0.text }.joined(separator: " ")

That is the core of it. Initialize with a model name and compute options, call transcribe, get text back. The simplicity of the API hides a lot of complexity underneath.

Why WhisperKit Over whisper.cpp

This was not an obvious choice. whisper.cpp is mature, widely used, and available on every platform. I considered it seriously before choosing WhisperKit. Here is the reasoning:

CoreML optimization. whisper.cpp runs on CPU with optional Metal acceleration. WhisperKit runs on the Neural Engine via CoreML. On Apple Silicon, the Neural Engine is significantly more power-efficient for ML workloads. For a menu bar app that users expect to have negligible battery impact, this efficiency gap matters.

Native Swift API. whisper.cpp requires C++ bridging headers, manual memory management for the model context, and callback-based APIs that do not play well with Swift's async/await. WhisperKit is Swift-native. It uses structured concurrency. It integrates with SwiftUI's state management patterns naturally.

Apple Silicon hardware acceleration. WhisperKit's CoreML backend can split computation across the Neural Engine and CPU simultaneously. The cpuAndNeuralEngine compute option allows the encoder and decoder to use whichever hardware is optimal for each operation. whisper.cpp cannot do this.

SPM integration. WhisperKit 0.9.0+ ships as a standard Swift Package Manager dependency. One line in Package.swift or a URL in Xcode's package manager. No Makefiles, no manual compilation of C++ source trees, no bridging headers to maintain.

The trade-off is platform lock-in. WhisperKit only runs on Apple platforms. If I needed cross-platform support, whisper.cpp would be the only reasonable choice. For a macOS-only app, WhisperKit is the better tool.

Model Management

WhisperKit supports five model sizes, each a trade-off between accuracy, speed, and resource consumption:

Model	Disk Size	RAM (Loaded)	Transcription Speed (M1)	Accuracy
Tiny	~75 MB	~90 MB	~15x real-time	Acceptable for clear speech
Base	~150 MB	~180 MB	~10x real-time	Good for most use cases
Small	~500 MB	~600 MB	~5x real-time	High accuracy
Medium	~1.5 GB	~1.8 GB	~2x real-time	Near-professional
Large	~3 GB	~3.5 GB	~1x real-time	Best available

"Faster than real-time" means a 60-second audio file transcribes in less than 60 seconds. The Base model on an M1 MacBook Air processes a one-minute recording in about 6 seconds. That is fast enough that users perceive it as nearly instant.

On-Demand Download

DropVox does not bundle any models in the app. Models are downloaded on demand when the user selects them. This keeps the initial download small (the app itself is under 20 MB) and lets users choose the size/accuracy trade-off that works for their hardware.

The download flow:

actor ModelManager {
    private let modelDirectory: URL

    init() {
        let appSupport = FileManager.default.urls(
            for: .applicationSupportDirectory,
            in: .userDomainMask
        ).first!
        modelDirectory = appSupport
            .appendingPathComponent("DropVox")
            .appendingPathComponent("Models")
    }

    func downloadModel(_ model: WhisperModel) async throws {
        let modelPath = modelDirectory.appendingPathComponent(model.identifier)

        if FileManager.default.fileExists(atPath: modelPath.path) {
            return // Already downloaded
        }

        try FileManager.default.createDirectory(
            at: modelDirectory,
            withIntermediateDirectories: true
        )

        // WhisperKit handles the actual download from Hugging Face
        _ = try await WhisperKit(
            model: model.identifier,
            modelFolder: modelDirectory.path,
            download: true
        )
    }

    func isModelAvailable(_ model: WhisperModel) -> Bool {
        let modelPath = modelDirectory.appendingPathComponent(model.identifier)
        return FileManager.default.fileExists(atPath: modelPath.path)
    }
}

Models are stored in the app's Application Support directory. They persist across app updates and are not included in Time Machine backups (I set the appropriate resource flag). A user who downloads the Base model gets it once and never waits again.

Error Handling for Downloads

Model downloads fail more often than you would expect. Network interruptions, insufficient disk space, corrupted partial downloads. I handle each case explicitly:

func downloadModel(_ model: WhisperModel) async throws {
    // Check disk space before downloading
    let requiredBytes = model.estimatedSize
    let availableBytes = try FileManager.default
        .attributesOfFileSystem(forPath: modelDirectory.path)[.systemFreeSize] as? Int64 ?? 0

    guard availableBytes > requiredBytes * 2 else {
        throw ModelError.insufficientDiskSpace(
            required: requiredBytes,
            available: availableBytes
        )
    }

    do {
        try await performDownload(model)
    } catch {
        // Clean up partial download
        let modelPath = modelDirectory.appendingPathComponent(model.identifier)
        try? FileManager.default.removeItem(at: modelPath)
        throw ModelError.downloadFailed(underlying: error)
    }
}

The requiredBytes * 2 check is intentional. WhisperKit extracts and converts the model during download, temporarily needing roughly double the final disk space. I learned this the hard way when a user with a nearly full disk reported a cryptic crash.

Loading Models Into Memory

Downloading a model and loading it are separate operations. A downloaded model sits on disk as a set of CoreML files. Loading it into memory means initializing the neural network weights on the Neural Engine or CPU. This takes time.

actor TranscriptionEngine {
    private var whisperKit: WhisperKit?
    private var currentModel: WhisperModel?
    private var isLoading = false

    func loadModel(_ model: WhisperModel, modelFolder: String) async throws {
        guard !isLoading else {
            throw TranscriptionError.modelCurrentlyLoading
        }

        isLoading = true
        defer { isLoading = false }

        // Unload previous model to free memory
        whisperKit = nil
        currentModel = nil

        whisperKit = try await WhisperKit(
            model: model.identifier,
            modelFolder: modelFolder,
            computeOptions: .init(audioEncoderCompute: .cpuAndNeuralEngine)
        )

        currentModel = model
    }
}

Loading times on an M1 MacBook Air:

Tiny: ~1 second
Base: ~2-3 seconds
Small: ~4-5 seconds
Medium: ~6-8 seconds
Large: ~8-10 seconds

These numbers are not insignificant. For a menu bar app where users expect near-instant response, I made the decision to pre-load the Base model at app launch. The app shows a loading indicator in the menu bar for those 2-3 seconds, then switches to a ready state. Larger models are loaded on demand when the user switches in settings.

The whisperKit = nil line before loading a new model is critical. Without it, you briefly hold two models in memory simultaneously. For a user switching from Medium (~1.8 GB) to Large (~3.5 GB), that is over 5 GB of RAM for a few seconds. On a MacBook Air with 8 GB total, that triggers memory pressure warnings and visible system slowdown.

The Transcription Pipeline

Once a model is loaded, transcription is straightforward but has nuances worth understanding:

func transcribe(
    _ audioURL: URL,
    language: String? = nil,
    progress: ((TranscriptionProgress) -> Void)? = nil
) async throws -> TranscriptionResult {
    guard let whisper = whisperKit else {
        throw TranscriptionError.modelNotLoaded
    }

    let options = DecodingOptions(
        language: language,
        temperature: 0.0,
        temperatureFallbackCount: 3,
        sampleLength: 224
    )

    let result = try await whisper.transcribe(
        audioPath: audioURL.path,
        decodeOptions: options
    ) { transcriptionProgress in
        let progressValue = TranscriptionProgress(
            completedSegments: transcriptionProgress.completedSegments,
            totalSegments: transcriptionProgress.totalSegments
        )
        progress?(progressValue)
        return true // Continue transcription
    }

    return TranscriptionResult(
        text: result.map { $0.text }.joined(separator: " "),
        segments: result.map { segment in
            Segment(
                text: segment.text,
                start: segment.start,
                end: segment.end
            )
        },
        language: language ?? "auto"
    )
}

A few details worth noting:

Temperature 0.0 with fallback. Setting temperature to 0 gives deterministic output. The fallback count of 3 means WhisperKit will retry with slightly higher temperature if the initial transcription has low confidence. This handles mumbled or noisy audio better than a fixed temperature.

Progress callbacks. WhisperKit calls back after each decoded segment. For long audio files, this lets me update a progress indicator showing "Transcribing... 14/23 segments." The callback returns a boolean -- returning false would cancel the transcription. I use this for the cancel button in the UI.

Segment-level output. The result includes individual segments with timestamps. DropVox stores these for potential future features like clickable timestamps, but currently joins them into a single text string.

Language Support

WhisperKit inherits Whisper's multilingual capabilities. DropVox supports 13 languages: English, Portuguese, Spanish, French, German, Italian, Dutch, Japanese, Korean, Chinese, Russian, Arabic, and Hindi.

Language can be specified explicitly or left as auto-detect. Auto-detection adds a few hundred milliseconds to the first segment because the model needs to identify the language before decoding. For users who always transcribe the same language, explicit selection is faster.

This matters to me personally. My family communicates in Portuguese. My work is in English. I set DropVox to auto-detect and it handles the switching without any configuration change.

One subtlety: language choice does not affect which model file is loaded. All WhisperKit models are multilingual. But the language parameter influences the decoder's token probabilities, so setting the correct language explicitly does improve accuracy, especially for languages with smaller representation in the training data.

This is where real-world experience diverges from documentation. WhisperKit's memory behavior needs careful management for apps that run continuously in the background.

The Base model consumes approximately 180 MB of RAM when loaded. That is acceptable for a menu bar app. But here is the problem: during transcription, memory spikes by an additional 100-200 MB depending on audio length, because the audio encoder holds intermediate activations in memory.

For a menu bar app, my approach is:

Load the Base model at launch (180 MB baseline)
During transcription, accept the temporary spike to ~350 MB
After transcription completes, the spike reclaims automatically
If the user selects a larger model, unload Base first, then load the new model

I considered unloading the model entirely when idle to reclaim memory, then reloading on demand. But the 2-3 second reload time made the app feel sluggish. Keeping Base loaded permanently is the right trade-off for a utility app.

For the Large model (~3.5 GB loaded), I display a warning in the settings UI: "Large models consume significant memory. Recommended for Macs with 16 GB or more RAM." Users on 8 GB machines can still select it, but they are informed about the trade-off.

The Migration from Python

DropVox started as a Python app using OpenAI's reference Whisper implementation. The migration to WhisperKit was not a simple port.

Python Whisper returns a single dictionary with full text and segments. WhisperKit returns an array of TranscriptionSegment objects. The data shapes are different enough that the history storage layer needed restructuring.

Python Whisper handles audio format conversion internally via ffmpeg. WhisperKit expects WAV input, so I added an audio conversion step using AVFoundation before transcription. This is actually better because I can reject unsupported formats with clear error messages rather than crashing deep in ffmpeg.

The biggest conceptual shift was concurrency. Python Whisper blocks the thread during transcription. In Swift, WhisperKit is fully async. The transcription runs on a background thread and the UI remains responsive throughout. But this means every call site needs async/await, the transcription engine needs to be an actor for thread safety, and progress updates need to be dispatched back to the main actor for UI updates.

@MainActor
class TranscriptionViewModel: ObservableObject {
    @Published var progress: Double = 0
    @Published var isTranscribing = false

    private let engine: TranscriptionEngine

    func transcribe(_ url: URL) async {
        isTranscribing = true
        defer { isTranscribing = false }

        do {
            let result = try await engine.transcribe(url) { progress in
                Task { @MainActor in
                    self.progress = Double(progress.completedSegments)
                        / Double(progress.totalSegments)
                }
            }
            // Handle result
        } catch {
            // Handle error
        }
    }
}

The @MainActor annotation on the view model ensures that @Published property updates happen on the main thread. The progress callback dispatches back to the main actor explicitly because WhisperKit calls it from a background thread. Getting this wrong causes runtime warnings in debug builds and potential UI glitches in production.

Practical Tips

If you are considering WhisperKit for your own macOS or iOS app, here is what I would tell you over coffee:

Start with the Base model. It is the best balance of speed, accuracy, and memory for most applications. You can always add model selection later.

Pre-load the model at app launch. The 2-3 second loading time is acceptable at launch but annoying if it happens when the user requests a transcription.

Handle the no-model state gracefully. On first launch, no model is downloaded yet. Show a clear onboarding flow that downloads the default model. Show progress. Let the user cancel.

Test on 8 GB machines. If you only develop on a 16 GB or 32 GB Mac, you will not notice memory issues that affect most users. Apple still sells 8 GB MacBook Airs. Those users are your audience too.

Pin your WhisperKit version. The API surface has changed between 0.7, 0.8, and 0.9. Use exact version pinning in your Package.swift until the API stabilizes at 1.0.

Audio format conversion is your responsibility. WhisperKit wants WAV. Your users will throw opus, mp3, m4a, and everything else at you. Build a conversion layer using AVFoundation before it reaches WhisperKit.

Test with accented speech and noisy environments. Whisper models handle clear English well. Performance degrades with background noise, heavy accents, and mixed-language input. Set user expectations accordingly.

Where WhisperKit Is Headed

Argmax continues active development. The roadmap includes word-level timestamps (partially available now), streaming transcription from live microphone input, and improved support for Apple's latest chips. The M3 and M4 Neural Engines are faster than M1, and WhisperKit is being optimized to take advantage of the newer hardware.

For DropVox, this means future versions will likely support real-time transcription of live audio, not just file-based transcription. The latency requirements are tight, but the hardware capability is already there.

On-device ML is not a compromise. For speech transcription on Apple Silicon, it is genuinely the best option available -- faster than cloud APIs, more private, and with no per-request cost. WhisperKit makes it accessible to any Swift developer willing to learn the integration points.

If you are building with WhisperKit or considering on-device ML for macOS, I am happy to compare notes. Find me on LinkedIn or at helrabelo.dev.