An Evaluation Framework for Voice Reconstruction

Audio demo for 17 zero-shot TTS systems evaluated on speakers with speech disorders from the Speech Accessibility Project (SAP) corpus

Each system receives a voice prompt (one recorded utterance from the speaker) and must synthesise a different target sentence in that speaker's voice. Objective metrics and subjective evaluations are shown per system.

WER = Word Error Rate (lower is better) Spk. Sim. = Speaker Similarity cosine score (higher is better) TTSDS = TTS Distribution Score (higher is better) PER = Phone Error Rate (lower is better) UTMOS = predicted Mean Opinion Score (higher is better) Log Worth = Plackett-Luce subjective log worth (higher is better)

Interactive Correlation Explorer

Select two metrics to visualise their Spearman rank correlation across all 18 systems. The orange point marks the Recording (ground truth).

X axis (objective) Y axis (subjective) Speaker group

High Intelligibility Speaker

Speaker B

High Intelligibility · Rec WER 0%

Voice prompt: “at last billy woodchuck's lips began to feel very weird, puckered up as they were”

Target text: “play the dear evan hansen soundtrack”

System	WER	Spk. Sim.	UTMOS	PER	TTSDS\|LibriTTS	TTSDS\|SAP	TTSDS Mean	Recon. Log Worth	Intel. Log Worth
Recording	0.000	0.684	1.433	0.852	77.39	93.64	85.51	0.000	0.000
E2-TTS	1.333	0.736	1.494	1.037	81.92	90.60	86.26	0.672	0.198
F5-TTS	0.000	0.660	1.470	0.741	85.25	85.80	85.52	0.182	1.464
Fish Speech	0.000	0.452	2.076	0.778	86.80	80.34	83.57	0.356	2.340
GPT-SoVITS	0.167	0.381	1.628	0.519	82.72	87.30	85.01	−0.644	−0.071
HierSpeech	0.333	0.315	3.290	0.370	78.64	88.86	83.75	−0.755	0.082
IndexTTS2	0.500	0.719	1.458	0.519	85.28	86.93	86.11	0.963	1.921
MaskGCT	0.667	0.661	1.467	0.778	79.31	90.84	85.07	0.141	−0.054
Metavoice	0.000	0.577	1.428	0.852	77.56	80.89	79.23	−1.176	−1.076
OpenVoice	0.000	0.263	2.827	0.333	84.07	76.81	80.44	−1.534	2.236
Qwen3-TTS	0.000	0.612	2.168	0.556	86.47	86.44	86.46	0.791	2.207
StyleTTS2	0.000	0.354	3.849	0.259	85.02	80.91	82.97	−0.788	2.447
TorToiSe	0.167	0.475	2.637	0.259	85.65	82.46	84.06	−0.828	1.940
Vevo	1.000	0.230	1.267	1.000	81.23	87.12	84.17	−0.944	0.030
VibeVoice	0.167	0.508	1.404	0.667	82.96	88.66	85.81	−0.420	0.929
VoiceCraft	0.833	0.400	1.914	0.815	78.57	88.25	83.41	−0.577	−0.654
WhisperSpeech	0.167	0.415	3.111	0.778	84.71	79.80	82.25	−1.355	1.208
XTTS	0.667	0.487	2.194	0.667	80.92	86.05	83.48	−1.289	0.490

Low Intelligibility Speaker

Speaker E

Low Intelligibility · Rec WER 125%

Voice prompt: “we had played a long while”

Target text: “navigate to o’hare airport”

System	WER	Spk. Sim.	UTMOS	PER	TTSDS\|LibriTTS	TTSDS\|SAP	TTSDS Mean	Recon. Log Worth	Intel. Log Worth
Recording	1.250	0.597	2.114	0.857	77.39	93.64	85.51	0.000	0.000
E2-TTS	0.250	0.760	2.235	0.952	81.92	90.60	86.26	0.672	0.198
F5-TTS	0.000	0.659	2.546	0.524	85.25	85.80	85.52	0.182	1.464
Fish Speech	0.500	0.319	3.740	0.810	86.80	80.34	83.57	0.356	2.340
GPT-SoVITS	0.500	0.576	2.889	0.810	82.72	87.30	85.01	−0.644	−0.071
HierSpeech	0.000	0.564	3.821	0.524	78.64	88.86	83.75	−0.755	0.082
IndexTTS2	0.000	0.698	3.101	0.667	85.28	86.93	86.11	0.963	1.921
MaskGCT	0.250	0.597	2.587	0.762	79.31	90.84	85.07	0.141	−0.054
Metavoice	0.500	0.494	2.464	0.857	77.56	80.89	79.23	−1.176	−1.076
OpenVoice	0.000	0.072	3.502	0.571	84.07	76.81	80.44	−1.534	2.236
Qwen3-TTS	0.000	0.552	3.732	0.476	86.47	86.44	86.46	0.791	2.207
StyleTTS2	0.000	0.354	4.038	0.762	85.02	80.91	82.97	−0.788	2.447
TorToiSe	0.000	0.379	3.849	0.429	85.65	82.46	84.06	−0.828	1.940
Vevo	1.000	0.280	2.254	0.857	81.23	87.12	84.17	−0.944	0.030
VibeVoice	0.250	0.423	2.526	0.667	82.96	88.66	85.81	−0.420	0.929
VoiceCraft	0.500	0.408	2.635	0.857	78.57	88.25	83.41	−0.577	−0.654
WhisperSpeech	0.250	0.286	3.300	0.667	84.71	79.80	82.25	−1.355	1.208
XTTS	0.750	0.422	2.707	0.810	80.92	86.05	83.48	−1.289	0.490

Listening Test Instructions

The following instructions were shown to participants in Prolific before each listening test.

Reconstruction (Speaker Similarity)

In this study, you will assess synthetic output intended as a personalised communication aid. We have used AI to reconstruct how the speaker sounded before they developed a speech impairment. We ask you to compare the synthetic samples against a real recording of the speaker (reference), and choose the most similar and least similar.

In the cases where the speaker has a speech impairment, do not focus on whether the speech impairment is matched to the reference speaker; instead, think of whether the synthetic output could sound like the person before they developed the impairment. You cannot select the same sample as both most and least similar.

You can play each sample (including the reference) as many times as you need, and change your selected answers as you listen along. In some instances, the audio sample might ask you to select it as best (most similar) or worse (least similar) — these are attention checks, so follow those instructions carefully.

Please wear headphones and be in a quiet environment before starting. When you are ready to start, please click Continue.

Question: Please select the audio sample that is most similar to the reference speaker, and the audio sample that is least similar to the reference speaker.

Remember that, if the speaker has a speech impairment, we want you to choose the sample that could sound closest/farthest to the person before developing the speech impairment.

Intelligibility

In this study, you will assess synthetic output intended as a personalised communication aid. We have used AI to reconstruct how the speaker sounded before they developed a speech impairment. We ask you to choose which audio sample is easiest to understand, and which one is hardest.

Sometimes, you might notice that the audio samples sound like different individuals — focus on how easy or difficult it is to understand only. You cannot select the same sample as both easiest and hardest to understand.

You can play each sample (including the reference) as many times as you need, and change your selected answers as you keep listening along. In some instances, the audio sample might ask you to select it as best (easiest to understand) or worse (hardest to understand) — these are attention checks, so follow those instructions carefully.

In order to progress to the next screen, you need to make sure that you have listened to all audio samples in full at least once.

Please wear headphones and be in a quiet environment before starting. When you are ready to start, please click Continue.

Question: Please select the audio sample that is easiest to understand, and the audio sample that is hardest to understand.

Remember that we want you to focus on how easy it is to understand, regardless of whether the speaker sounds different.

WER is computed using Whisper large-v3 on the synthesised audio against the target text. Values > 1.0 indicate the transcript contains more erroneous words than the reference has words (insertions).
Spk. Sim. is the cosine similarity between WeSpeaker embeddings of the synthesised audio and the speaker's voice prompt. Higher values indicate the system better preserved the speaker's voice characteristics.
TTSDS|LibriTTS and TTSDS|SAP are the overall TTSDS scores when using LibriTTS or SAP as the reference dataset respectively. TTSDS Mean is their average.
PER is the Phoneme Error Rate. UTMOS is a neural predicted MOS score.
Recon. Log Worth and Intel. Log Worth are Plackett-Luce log worth estimates from subjective evaluations of speaker similarity and perceived intelligibility respectively.
Recording rows show the speaker's own reading of the target sentence (ground truth). For low-intelligibility speakers, the recording itself has very high WER.
WER, Spk. Sim., UTMOS, and PER in the audio tables are computed on the individual utterance shown. All other metrics (TTSDS, Log Worth) are system-wide aggregates across all speakers.
Speaker identifiers are anonymised. All audio is from the SAP corpus.
Click any column header to sort the table. In the correlation explorer, hover over points for details.