How the Apple Neural Engine powers Mac dictation
Short answer: the Apple Neural Engine is the dedicated machine-learning accelerator on every Apple Silicon chip, and it is the reason a 600 MB speech model can transcribe your voice in real time on a laptop instead of on a server. Apple’s Core ML framework schedules the heavy math onto it, and the unified memory shared across the chip means the model runs without being copied between separate memory pools. That combination is what made Apple Neural Engine speech recognition practical, and it is the hardware story underneath what local speech-to-text looks like on a Mac in practice. Here is how the pieces fit, and what it takes to run a model this size without melting the machine.
What changed: three pieces of Apple Silicon
For most of the modern history of speech recognition, the model ran in a data center. Phones and laptops could record audio and stream it, but they could not run a state-of-the-art recognizer fast enough to feel instant. What shifted that on the Mac was not one feature but three working together: the Neural Engine, Core ML and unified memory. Each does a distinct job, and CoreML speech-to-text only feels instant when all three are pulling in the same direction.
- The Neural Engine is the dedicated accelerator that does the actual matrix math, fast and at low power.
- Core ML is Apple’s framework that takes a converted model and schedules its work across the CPU, GPU and Neural Engine.
- Unified memory is the single pool of RAM all three processors share, so the model is not shuttled between separate memories to run.
The Neural Engine, in plain terms
A neural network is, at the core, an enormous pile of multiply-and-add operations. A CPU can do them, but it is a generalist and burns power doing so. A GPU is better, but it is also busy drawing your screen and is hungrier still. The Neural Engine is a circuit built for nothing but that matrix math, which is why it runs a speech model fast while sipping power. On a laptop running off a battery, that efficiency is not a nicety; it is the difference between dictation you can lean on all day and a fan that spins up every time you talk.
The practical payoff for dictation is that transcription does not have to fight whatever else the machine is doing. Your CPU and GPU stay free for the app you are actually working in, and the speech model gets its own dedicated silicon. That is why running inference on the Neural Engine rather than the CPU is the bulk of the conversion work behind Parakeety, a point covered in the account of shipping a 600 MB speech model inside a 2 MB app.
Where Core ML fits
The Neural Engine is hardware; Core ML is the software layer that decides what runs where. Apple describes Core ML as taking advantage of the CPU, GPU and Neural Engine together, in the most efficient way it can, to maximize performance while keeping memory and power use down. It also states plainly that Core ML models run strictly on the device, which removes any need for a network connection and keeps user data private. That sentence is the whole on-device argument in one line: the model is on your machine, so the audio has nowhere else to go.
The catch is that a model does not arrive in Core ML format. The speech model behind Parakeety, NVIDIA’s Parakeet TDT 0.6B v3, is published for NVIDIA GPUs through the NeMo toolkit. Getting it onto the Neural Engine means converting the PyTorch weights through Apple’s Core ML toolchain, then verifying the converted model produces numerically equivalent output to the original. The deeper version of that path, including the CoreML and MLX options, is in the primer on running NVIDIA Parakeet on a Mac.
Why unified memory matters for a 600 MB model
On a traditional PC with a discrete GPU, the CPU and GPU each have their own memory. To run a model on the GPU you copy the weights across a bus into GPU memory first, which costs time and energy. Apple Silicon does not work that way. The CPU, GPU and Neural Engine share one pool of unified memory, so a 600 MB model loads once and every processor can reach it without a copy.
For a speech model this is the quiet enabler. The audio capture happens on the CPU, the model weights live in memory, and the Neural Engine reads them in place. There is no shuffling 600 MB back and forth on every utterance, which keeps both the latency and the power draw low. The model card for Parakeet TDT 0.6B v3 notes it needs at least 2 GB of RAM to load, which a unified-memory Mac handles without breaking stride.
The two numbers that surprise people
Parakeety is a roughly 2 MB menu-bar app that runs a roughly 600 MB speech model. Those numbers look contradictory until you see that the app and the model are separate. The 2 MB is the Swift program: the menu-bar UI, the audio pipeline, the push-to-talk handling and the inference glue. The 600 MB is the trained model, downloaded once on first launch and stored separately. After that download, transcription runs entirely on the Neural Engine and works offline.
| Component | Size | What it is |
|---|---|---|
| The app | ~2 MB | Swift menu-bar program: UI, audio capture, push-to-talk, the glue around inference |
| The model | ~600 MB | Parakeet TDT 0.6B v3, downloaded once on first launch, runs on the Neural Engine |
| Memory to load | From 2 GB RAM | Loaded once into unified memory, read in place by the Neural Engine |
Why the model choice is the other half of the story
Hardware made local transcription fast; the model made it accurate. Accuracy is a property of the model, not of where it runs, and the one that changed what is possible on a Mac is Parakeet TDT 0.6B v3: roughly 600 million parameters, a transducer architecture and the current top of the Hugging Face Open ASR Leaderboard. It posts a 6.32% word error rate against Whisper Large V3’s 7.44%, and its transducer design produces silence during silence instead of inventing text the way encoder-decoder models can.
Speed is where the Neural Engine and the model meet. NVIDIA’s published numbers put Parakeet TDT 0.6B v3 at around a 3,333x real-time factor on Apple Silicon, against around 146x for Whisper Large V3. For batch work that gap matters less, but for push-to-talk dictation, where you want the words on screen before you have finished the thought, the speed is the whole felt experience. The trade is language coverage: 25 European languages rather than the hundred or so Whisper handles. The wider category framing is in the explainer on what on-device speech recognition is and why it only recently became practical.
Running it without melting the machine
The phrase people reach for is “running a server-class model on a laptop”, and the worry behind it is heat and battery. The reason it does not melt the machine is the same reason it is fast: the work is on dedicated, power-efficient silicon rather than brute-forced on the CPU. Dictation is also bursty, not continuous. You hold a key, talk for a few seconds, release. The model runs in that short window, transcribes, pastes at the cursor, and the audio buffer is discarded. The Neural Engine is idle the rest of the time.
That bursty, on-device shape is also what makes the privacy guarantee structural rather than a setting. Audio is captured to memory while you hold the key, transcribed on the Neural Engine when you release, and discarded. Nothing is written to disk, nothing is uploaded. Parakeety’s only network calls are the one-time model download and periodic license checks, never your voice. There is no account, no subscription and no analytics.
FAQ
- What is the Apple Neural Engine?
- The Apple Neural Engine is a dedicated machine-learning accelerator built into every Apple Silicon chip, sitting alongside the CPU and GPU. It is purpose-built for the matrix math that neural networks run on, which it does fast and at low power. For dictation, that means a speech model can transcribe in real time without competing with whatever your CPU and GPU are already doing, and without draining the battery the way sustained CPU inference would.
- Does dictation on the Neural Engine work offline?
- Yes. Once the speech model is on the Mac, transcription runs entirely on the Neural Engine with no network in the loop. Parakeety downloads its model once on first launch, a one-time transfer of around 600 MB, and after that it works offline. The simplest test is Airplane Mode: if dictation still works with the network off, the model is running on-device.
- Why can a Mac run a 600 MB speech model when a phone could not before?
- Two things had to arrive together. Speech models got small and efficient enough to fit in a few hundred megabytes without losing accuracy, and Apple Silicon shipped a Neural Engine that runs them fast. Core ML schedules the heavy matrix math onto that accelerator, and the unified memory shared across the CPU, GPU and Neural Engine means the model does not have to be copied between separate memory pools to run. Before that combination, on-device transcription was either slow or noticeably less accurate than the cloud.
- Is Apple Dictation the same as running a model on the Neural Engine?
- Apple Dictation does use on-device recognition for many languages on modern Macs, and that work runs on Apple Silicon. The difference is the model. Apple ships its own recognizers tuned for short, casual dictation, while Parakeety runs NVIDIA's Parakeet TDT 0.6B v3 on the Neural Engine, which tops the Hugging Face Open ASR Leaderboard. Both are on-device; the accuracy and the push-to-talk flow are where they part.
Try it
Parakeety is a Mac menu-bar app that runs the Parakeet TDT v3 model on the Apple Neural Engine. Hold the section key, talk, release; your words paste at the cursor in whichever app you were typing into. Audio never leaves the machine. It needs Apple Silicon and macOS 14 or later. There is a free 7-day trial with no card required. After that it is $30 once.