How I shipped a 600 MB speech model inside a 2 MB Mac app
Parakeety is a 2 MB menu-bar app that runs a 600 MB speech model on the Apple Neural Engine. Those two numbers surprise people, so this is how the trick works: the app and the model are separate, the model downloads once on first launch, and the real engineering is converting a model built for NVIDIA GPUs into something that runs efficiently on Apple Silicon. Here is the honest version of that path.
The two numbers
The app is small because it is mostly glue: a Swift menu-bar program, the floating recording pill, the audio capture pipeline, the keyboard handling for push-to-talk, and the code that pastes the transcript at the cursor. None of that is large. The weight is in the speech model, NVIDIA’s Parakeet TDT 0.6B v3, which is roughly 600 MB of trained parameters.
So I keep them apart. The 2 MB you download from the site is the app. The 600 MB model downloads once on first launch and lives separately on disk. Bundling the model into the installer would mean a 600 MB download every time you update a 2 MB program, which is the wrong trade. The cost of separating them is that first launch needs the network once; after that, transcription is entirely on-device and works offline.
The actual hard part: getting the model onto Apple Silicon
The upstream Parakeet release is not built for Macs. NVIDIA ships it through the NeMo toolkit, which expects an NVIDIA GPU and a CUDA stack, none of which exists on Apple Silicon. The open weights are the starting point, not the finish line. The work is the path from PyTorch weights to something that runs fast on the Apple Neural Engine.
In rough order, that meant: getting the model running on CPU first as a correctness baseline, then converting it through Apple’s Core ML toolchain so it targets the Neural Engine, then verifying the converted model produced numerically equivalent output to the original. Around that sits the audio side: capturing microphone input, resampling to the 16 kHz mono the model expects, handling voice activity, and feeding it the buffer cleanly. The broader shape of this conversion work is covered in the primer on running Parakeet on a Mac.
The payoff for doing it on the Neural Engine rather than the CPU or GPU is that transcription is fast and power-efficient, and it does not fight whatever else your machine is doing. For push-to-talk dictation, where you want the words on screen before you have finished the thought, that speed is the whole felt experience.
Keeping it private by architecture
Because the whole pipeline runs locally, privacy is a property of the architecture rather than a promise in a policy. While you hold the key, audio is captured to a buffer in memory. When you release, that buffer goes through the model, the words paste at the cursor, and the audio is discarded. Nothing is written to disk, nothing is uploaded, and the app keeps no history. The only network traffic is the one-time model download and a periodic license check, never your voice. If you want the framing of why that matters more than a contractual promise, I wrote about architectural versus contractual privacy separately.
For the wider context of where this app sits among the alternatives, the map is in the complete guide to local speech-to-text on Mac. And if you want the non-technical version of why the app exists at all, that is in why I built Parakeety.
FAQ
- Why is the app only 2 MB if the model is 600 MB?
- Because the model is not bundled into the app. The 2 MB is the Swift app itself, the menu-bar UI, the audio pipeline and the inference glue. The roughly 600 MB speech model downloads once on first launch and is stored separately. Bundling it would mean shipping a 600 MB installer every time, which is a worse experience for a 2 MB program.
- Why download the model instead of bundling it?
- Two reasons. A 600 MB download attached to every app update is wasteful when the model rarely changes. And keeping the model separate from the app binary means I can update the app without re-shipping the weights, and swap the model independently if a better one arrives. The trade is that first launch needs an internet connection once.
- Does Parakeety run the model on the GPU or the Neural Engine?
- The Apple Neural Engine, the dedicated ML accelerator on every M-series chip. Running it there means transcription does not compete with whatever your CPU and GPU are doing, and it is power-efficient. Getting the model onto the Neural Engine is the bulk of the conversion work, since the upstream release targets NVIDIA GPUs.
Try it
Parakeety is a Mac menu-bar app. Hold the section key, talk, release; your words paste at the cursor in whichever app you were typing into. Audio never leaves the machine. There is a free 7-day trial with no card required. After that it is $30 once.