What is a good word error rate for dictation?

Anything under roughly 10% on a clean English benchmark is past the point where the model is the limiting factor for dictation. The top open models sit well below that: Parakeet TDT 0.6B v3 posts 6.32% on the Hugging Face Open ASR Leaderboard and Whisper Large V3 posts 7.44%. At those levels your accent, microphone, background noise and domain vocabulary affect your real transcripts more than the difference between two leaderboard scores does.

How is word error rate calculated?

Word error rate is the number of word-level errors divided by the number of words in the reference transcript. Errors come in three kinds: substitutions (a wrong word), deletions (a missing word) and insertions (an extra word). Add those three counts together and divide by the total reference words. A 100-word reference with three substitutions, one deletion and one insertion scores five errors over a hundred words, a 5% word error rate. Lower is better.

Does a lower WER always mean better dictation?

Not on its own. Word error rate is measured on a fixed benchmark corpus in one language, usually English, and it ignores punctuation, capitalization and formatting because those are normalized away before scoring. It also says nothing about speed, behavior in silences, or how the model handles your accent and your jargon. A model one point lower on the leaderboard can feel worse in daily use if it is slower or invents text in your pauses.

What does word error rate not capture?

Quite a lot of what makes dictation usable. Standard scoring strips punctuation, capitalization and formatting, so a model that punctuates well and one that does not can post the same score. It is measured on benchmark audio, not your accent, your microphone or your background noise. It is reported per language, usually English. And it says nothing about transcription speed or whether a model hallucinates text during silence. WER ranks models fairly against each other; it is not a promise about your own transcripts.

← Resources

Speech-to-text accuracy: word error rate explained

Short answer: word error rate (WER) is the share of words a speech model gets wrong against a known-correct transcript, counting substitutions, deletions and insertions, divided by the number of words in the reference. Lower is better. It is the number on the Hugging Face Open ASR Leaderboard, and it is the cleanest single figure for ranking models. But it is one number measured on one English corpus, and it leaves out most of what makes dictation feel accurate in daily use. This guide explains what WER measures, how to read the leaderboard, what a one-point gap actually feels like, and the things WER does not capture, which matter when you are choosing a local speech-to-text app for Mac.

What word error rate measures

WER compares a model's transcript to a reference transcript that a human has verified as correct. It lines the two up word by word and counts three kinds of mistake:

Substitutions. The model wrote the wrong word. You said "their" and it wrote "there".
Deletions. The model dropped a word that was spoken. You said "the final report" and it wrote "the report".
Insertions. The model added a word that was not spoken. You said "send it now" and it wrote "send it out now".

Add the three counts together and divide by the number of words in the reference. That is the rate. A 100-word reference with three substitutions, one deletion and one insertion has five errors over a hundred words, which is a 5% WER. The figure can exceed 100% in pathological cases, because insertions are counted on top of the reference length, but for any usable model it sits in single digits.

The key word is reference. WER is only ever a comparison against one specific human transcript of one specific set of recordings. Change the recordings and the number changes. That is why a leaderboard fixes the corpus: so the only thing varying between two scores is the model.

How to read the Open ASR Leaderboard

The standard public reference for English accuracy is the Hugging Face Open ASR Leaderboard. It runs a fixed set of English test data through each model and reports an average WER across those datasets, alongside a real-time factor that measures speed. Because every model is scored the same way on the same audio, the ranking is a fair head-to-head. Two things are worth understanding before you read a row off it.

First, the scores are aggregates. The headline WER is averaged over several English datasets that span clean read speech, conversational audio and noisier recordings. A model can be excellent on clean audio and weaker on noisy audio and still post a strong average, so the single figure smooths over where a model is strong and where it struggles.

Second, the audio is not your audio. The corpus is a fixed set of recordings chosen to be representative, not a recording of you at your desk. Your own accuracy depends on your accent, your microphone, the noise in your room and how much specialist vocabulary you use. The leaderboard ranks models against each other; it does not promise you a particular number.

The two figures that anchor most of these comparisons come straight off that benchmark. Here is how the two open models that most Mac dictation apps are built on line up.

Model	Word error rate	What that means
Parakeet TDT 0.6B v3	6.32%	Roughly six wrong words in every hundred on the benchmark corpus
Whisper Large V3	7.44%	Roughly seven wrong words in every hundred on the same corpus

Lower wins, so Parakeet edges Whisper on the headline number. The full model-level treatment, including architecture, speed and language coverage, is in the comparison of the two open speech models behind most Mac apps, and the background on the model itself sits in the primer on running NVIDIA Parakeet on a Mac.

What a one-point difference actually feels like

The gap between 6.32% and 7.44% is about one point, which is roughly one extra error in every hundred words. To make that concrete: a dense paragraph of dictated prose runs around 100 to 120 words. At a 6% WER you would expect somewhere near six or seven slips in that paragraph; at 7.4% you would expect closer to eight. The difference between the two is one word.

That is worth sitting with, because it cuts against the instinct to chase the top of the leaderboard. One word per paragraph is below the level most people notice while editing, and well below the variation your own setup introduces. Move from a built-in laptop microphone to a decent headset and you will shift your personal error rate by more than a point. Dictate in a quiet room instead of a noisy one and you will shift it by more again. At the top of the leaderboard, the model has stopped being the bottleneck. Your microphone and your room are.

The honest reading is that any model under about 10% on a clean English benchmark is past the threshold where WER is the thing to optimize for dictation. Below that line, the differences that decide whether a tool is pleasant to use day to day are mostly things WER does not measure at all.

What WER leaves out

Standard WER scoring normalizes the text before it compares, which means it deliberately throws away several things that matter enormously for real dictation. Each of these is invisible to the headline number.

What WER ignores	Why it matters for dictation
Punctuation and capitalization	Stripped before scoring, so a model that punctuates cleanly and one that produces a wall of lowercase text can post the same WER. For dictation this is most of the editing work.
Formatting and numbers	Normalization rewrites "twenty twenty six" and "2026" to one form, so how a model renders dates, currency and lists is not tested by the score at all.
Your accent and microphone	The benchmark audio is fixed and is not you. A model strong on the corpus can still misread your accent or struggle through a cheap mic.
Domain vocabulary	Clinical, legal and technical terms barely appear in general benchmarks, so the score says little about how a model handles the jargon you actually dictate.
Language	The headline figure is English. A model's accuracy in French, German or Polish is a separate question the English number does not answer.
Speed and behavior in silence	WER is accuracy only. It does not measure how fast text appears, or whether a model invents words during a pause, both of which dominate how dictation feels.

Punctuation and formatting: the part you feel most

Of everything WER omits, punctuation is the one that shapes the daily experience most. Benchmark scoring removes punctuation and case before comparing, so two models with identical WER can differ wildly in how much cleanup their output needs. One drops commas and full stops where you paused; the other reads your phrasing and lays in sentence breaks. The leaderboard treats them as equal. Your editing time does not.

This is why auto-punctuation is a feature worth checking directly rather than inferring from a WER figure. A model that punctuates from your cadence saves the step of going back to add the breaks by hand, which is often the slowest part of cleaning up dictated text.

Speed and silence: also not on the scoreboard

WER is purely an accuracy metric. The leaderboard reports speed as a separate real-time factor, and for interactive dictation that figure matters as much as accuracy. A model that is one point more accurate but adds a noticeable pause before the text appears can feel worse to use than a faster, slightly less accurate one.

Behavior in silence is a third axis WER cannot see, because benchmark audio is mostly continuous speech. In real dictation you stop and think constantly. Some architectures stay quiet through a pause; others can invent text to fill it. Two models with the same WER on continuous benchmark audio can behave very differently the moment you take a breath mid-sentence. The architecture-level reasons for that difference are unpacked in the comparison of transducer and encoder-decoder speech models.

How to actually compare dictation apps

WER is a good first filter and a poor final answer. A sensible way to use it:

Use WER to rule out, not to rank. A model above roughly 10% on a clean English benchmark is worth skipping. Among the models below that line, the WER ranking tells you almost nothing about how the tool will feel.
Check the language you dictate in. The headline figure is English. If you work in another language, that English number is not your number.
Test it on your own voice. Dictate a paragraph you would actually write, with your microphone and your accent and your jargon, and read the output. That single test tells you more than any benchmark.
Weigh punctuation and speed alongside accuracy. The cleanup work and the latency are what you live with every day, and neither shows up in WER.

The round-up that applies this frame across the field is the guide to the best Mac dictation apps in 2026, which weighs each option on the things the score leaves out as well as the score itself.

Where Parakeety sits

Parakeety runs Parakeet TDT 0.6B v3, the model that posts 6.32% on the Open ASR Leaderboard, on the Apple Neural Engine on your Mac. On the things WER does not measure, that choice carries through: the transducer architecture stays quiet during your pauses instead of inventing text, the model runs fast enough that the transcript pastes the instant you release the key, and auto-punctuation lays in the sentence breaks from your cadence. It covers 25 European languages with automatic detection, so the English headline figure is not the whole accuracy story for multilingual work. The thing it cannot do is beat your own setup: a good microphone and a quiet room still move your personal error rate more than the gap between any two top models.

FAQ

What is a good word error rate for dictation?: Anything under roughly 10% on a clean English benchmark is past the point where the model is the limiting factor for dictation. The top open models sit well below that: Parakeet TDT 0.6B v3 posts 6.32% on the Hugging Face Open ASR Leaderboard and Whisper Large V3 posts 7.44%. At those levels your accent, microphone, background noise and domain vocabulary affect your real transcripts more than the difference between two leaderboard scores does.
How is word error rate calculated?: Word error rate is the number of word-level errors divided by the number of words in the reference transcript. Errors come in three kinds: substitutions (a wrong word), deletions (a missing word) and insertions (an extra word). Add those three counts together and divide by the total reference words. A 100-word reference with three substitutions, one deletion and one insertion scores five errors over a hundred words, a 5% word error rate. Lower is better.
Does a lower WER always mean better dictation?: Not on its own. Word error rate is measured on a fixed benchmark corpus in one language, usually English, and it ignores punctuation, capitalization and formatting because those are normalized away before scoring. It also says nothing about speed, behavior in silences, or how the model handles your accent and your jargon. A model one point lower on the leaderboard can feel worse in daily use if it is slower or invents text in your pauses.
What does word error rate not capture?: Quite a lot of what makes dictation usable. Standard scoring strips punctuation, capitalization and formatting, so a model that punctuates well and one that does not can post the same score. It is measured on benchmark audio, not your accent, your microphone or your background noise. It is reported per language, usually English. And it says nothing about transcription speed or whether a model hallucinates text during silence. WER ranks models fairly against each other; it is not a promise about your own transcripts.

Try it

Parakeety is a Mac menu-bar app that runs the top-of-leaderboard Parakeet TDT v3 model on the Apple Neural Engine. Hold the section key, talk, release; your words paste at the cursor in whichever app you were typing into, punctuation and all. Audio never leaves the machine. It needs Apple Silicon and macOS 14 or later. There is a free 7-day trial with no card required. After that it is $25 once.

Try Parakeety free →