What is on-device speech recognition?
Short answer: on-device speech recognition means the model that converts your voice into text runs on your own computer, so the audio never leaves the machine. The opposite is cloud transcription, which uploads your audio to a remote server, runs the model there and sends text back. Local speech-to-text used to be slow or less accurate; on modern hardware it is neither. If you want the practical version of this for one platform, the guide to local speech-to-text on the Mac specifically maps the apps and the trade-offs. This page explains the category itself: how the two architectures differ, why on-device only recently became practical, what it buys you and where it falls short.
On-device versus cloud, mechanically
Every speech-to-text system has the same job: take an audio signal of someone talking and produce the words they said. What separates the two architectures is where the model that does that work physically lives.
In a cloud system, your microphone audio is captured, encoded and streamed over the internet to a data center. A model running on the provider’s servers transcribes it, and the resulting text travels back to your device. The model is large and runs on server hardware; your device is just a recorder and a display. The round-trip happens for every utterance.
In an on-device system, the model is installed on your own machine. Audio is captured to memory, fed straight into the local model, and the text comes out on the same device. Nothing is transmitted. The network is not in the loop at all once the model is in place.
| On-device (local) | Cloud | |
|---|---|---|
| Where the model runs | Your own machine | The provider’s servers |
| Does audio leave the device | No | Yes, every time |
| Works offline | Yes | No |
| Latency source | Local compute only | Compute plus a network round-trip |
| Typical pricing model | One-time purchase or free | Subscription or per-minute metering |
| Language coverage | Bounded by the local model | Often very wide |
Why it only recently became practical
Speech recognition has run in the cloud for most of its modern history for a simple reason: the models were too big and too compute-hungry for the hardware most people owned. Phones and laptops could record audio and stream it, but they could not run a state-of-the-art recognizer fast enough to feel instant. The wake-word detection in “Hey Siri” or “OK Google” ran locally; the actual transcription that followed went to a server.
Two changes shifted that. The first is on the model side: neural speech models got both more accurate and more efficient, and techniques for compressing them shrank the footprint without giving up much accuracy. A capable recognizer that once needed a rack of GPUs now fits in a few hundred megabytes.
The second is on the hardware side, and it is the part that made consumer devices viable. Apple Silicon Macs ship with a dedicated machine-learning accelerator, the Apple Neural Engine, sitting alongside the CPU and GPU. Apple’s Core ML framework schedules model work across those processors and can route the heavy matrix math onto the Neural Engine, where it runs fast and at low power. The unified memory shared across CPU, GPU and Neural Engine means the model does not have to be copied between separate memory pools to run. We go deeper on that in the explainer on how the Apple Neural Engine powers on-device dictation.
Put the efficient model and the accelerator together and a speech recognizer that would once have lived on a server now runs in real time on a laptop. That is the recent part. The architecture was always possible; the consumer hardware to make it pleasant arrived in the last few years.
What on-device buys you
The case for local transcription is four things at once.
- Privacy. If the audio never leaves the device, there is no copy of your voice on someone else’s server, no data lake to subpoena and no question of what a vendor logs or trains on. The privacy guarantee is structural, not a setting you have to trust.
- Speed. A cloud service adds a network round-trip to every utterance. A local model does not, so the text appears with the latency of the compute alone. For live dictation, where you want words on screen before you have finished the sentence, that gap is the part you feel most.
- Offline. No network means no dependency on one. Local transcription works on a plane, in a clinic basement or anywhere the connection is poor or absent.
- Economics. Cloud recognition costs the provider compute for every second you speak, so it is almost always metered or sold as a subscription. A model that runs on your hardware uses your compute, which makes a one-time purchase or a free tool viable. The economics of buying a dictation app once versus renting one follow directly from where the model runs.
For regulated work the privacy point becomes a compliance point. There are two valid answers to a clinical or legal privacy requirement: cloud transcription under a contract that constrains how the audio is handled, or on-device transcription where the audio never travels in the first place. The distinction between architectural and contractual privacy is the framing that decides which fits a given practice.
The honest limitations
On-device is not free of trade-offs. Three are worth stating plainly.
- Language coverage. A local model can only cover the languages it was trained and shipped for. The largest cloud models handle a hundred or more languages because a data center can hold all of them at once. A model sized to run on a laptop usually covers fewer. Parakeet TDT 0.6B v3, the model behind Parakeety, covers 25 European languages; for anything outside that list a wider model is the right tool. The trade-offs of dictating across multiple languages on a Mac turn on exactly this.
- Disk footprint. The model has to live on your machine, which means real storage. Parakeety’s model is around 600 MB, downloaded once on first launch. That is modest by modern standards, but it is not zero, and it is the cost of not depending on a server.
- Device-bound performance. Local speed depends on local hardware. The acceleration that makes this pleasant exists on Apple Silicon; on older or less capable machines the same model runs slower or not at all. Parakeety requires Apple Silicon and macOS 14 or later for that reason.
One more clarification that trips people up: “local” is a spectrum, not a checkbox. Several apps transcribe on-device but then route the transcript through a cloud large language model for cleanup, which puts your words back on the network. So the useful question is rarely “is it local” but “which part is local”. The Airplane Mode test answers it: if a tool stops working with the network off, some part of it is not running on your machine.
The model is what makes it work
Accuracy lives in the model, not in the architecture. The reason on-device transcription is now competitive rather than a privacy compromise is that the models have caught up. The one that has changed what is possible on a Mac is NVIDIA’s Parakeet TDT 0.6B v3: roughly 600 million parameters, a transducer architecture and the current top of the Hugging Face Open ASR Leaderboard. It posts a 6.32% word error rate against Whisper Large V3 at 7.44%, and its transducer design produces silence during silence instead of inventing text the way encoder-decoder models can. The full primer on what Parakeet is and how it runs on a Mac covers the technical detail, and the standalone explainer on how word error rate is actually measured unpacks what those percentages mean.
Parakeety puts that model behind a push-to-talk dictation flow. Hold a key, talk, release, and the transcribed text pastes at the cursor in whichever Mac app you were typing into. The model runs on the Apple Neural Engine. Audio is captured to memory, transcribed and discarded. The app’s only network calls are periodic license checks and the one-time model download; neither carries your audio. There is no account, no subscription and no analytics.
FAQ
- What does on-device speech recognition mean?
- On-device speech recognition means the model that turns your voice into text runs on your own computer, so the audio never leaves the machine. There is no upload to a server and no internet round-trip in the transcription path. Cloud speech-to-text does the opposite: it streams your audio to a remote service that runs the model and sends text back. The simplest test is Airplane Mode. If dictation still works with the network switched off, the model is running locally.
- Why is on-device speech to text only practical now?
- Two things had to arrive together: speech models small and efficient enough to run on a laptop, and consumer hardware with a dedicated machine-learning accelerator to run them fast. Apple Silicon Macs ship with a Neural Engine alongside the CPU and GPU, and Core ML schedules model work onto it. A 600-million-parameter speech model that would once have needed a server now runs in real time on a Mac. Before that combination, local transcription was either slow or noticeably less accurate than the cloud.
- Is local speech recognition less accurate than the cloud?
- Not inherently. Accuracy is a property of the model, not of where it runs. On a modern Apple Silicon Mac the model Parakeety uses, Parakeet TDT 0.6B v3, posts a 6.32% word error rate against Whisper Large V3 at 7.44% on the Hugging Face Open ASR Leaderboard. A good local model matches or beats most cloud dictation, with none of the latency of a network round-trip.
- What are the limitations of on-device speech recognition?
- Three honest ones. Language coverage is narrower than the largest cloud models: Parakeet TDT v3 covers 25 European languages, not the hundred-plus a service like Whisper handles. The model takes disk space, around 600 MB for Parakeety, downloaded once on first launch. And it is tied to capable hardware, so it needs Apple Silicon and macOS 14 or later. Within those bounds it is fast, private and works offline.
Try it
Parakeety is a Mac menu-bar app and a working example of on-device speech recognition. Hold the section key, talk, release; your words paste at the cursor in whichever app you were typing into. Audio never leaves the machine. It needs Apple Silicon and macOS 14 or later. There is a free 7-day trial with no card required. After that it is $30 once.