An Evaluation Framework for Voice Reconstruction

Audio demo for 17 zero-shot TTS systems evaluated on speakers with speech disorders from the Speech Accessibility Project (SAP) corpus

Each system receives a voice prompt (one recorded utterance from the speaker) and must synthesise a different target sentence in that speaker's voice. Objective metrics and subjective evaluations are shown per system.

WER = Word Error Rate (lower is better) Spk. Sim. = Speaker Similarity cosine score (higher is better) TTSDS = TTS Distribution Score (higher is better) PER = Phone Error Rate (lower is better) UTMOS = predicted Mean Opinion Score (higher is better) Log Worth = Plackett-Luce subjective log worth (higher is better)

Interactive Correlation Explorer

Select two metrics to visualise their Spearman rank correlation across all 18 systems. The orange point marks the Recording (ground truth).

High Intelligibility Speaker

Speaker B

High Intelligibility · Rec WER 0%

Voice prompt: “at last billy woodchuck's lips began to feel very weird, puckered up as they were”

Target text: “play the dear evan hansen soundtrack”

System Audio WER Spk. Sim. UTMOS PER TTSDS|LibriTTS TTSDS|SAP TTSDS Mean Recon. Log Worth Intel. Log Worth
Recording0.0000.6841.4330.85277.3993.6485.510.0000.000
E2-TTS1.3330.7361.4941.03781.9290.6086.260.6720.198
F5-TTS0.0000.6601.4700.74185.2585.8085.520.1821.464
Fish Speech0.0000.4522.0760.77886.8080.3483.570.3562.340
GPT-SoVITS0.1670.3811.6280.51982.7287.3085.01−0.644−0.071
HierSpeech0.3330.3153.2900.37078.6488.8683.75−0.7550.082
IndexTTS20.5000.7191.4580.51985.2886.9386.110.9631.921
MaskGCT0.6670.6611.4670.77879.3190.8485.070.141−0.054
Metavoice0.0000.5771.4280.85277.5680.8979.23−1.176−1.076
OpenVoice0.0000.2632.8270.33384.0776.8180.44−1.5342.236
Qwen3-TTS0.0000.6122.1680.55686.4786.4486.460.7912.207
StyleTTS20.0000.3543.8490.25985.0280.9182.97−0.7882.447
TorToiSe0.1670.4752.6370.25985.6582.4684.06−0.8281.940
Vevo1.0000.2301.2671.00081.2387.1284.17−0.9440.030
VibeVoice0.1670.5081.4040.66782.9688.6685.81−0.4200.929
VoiceCraft0.8330.4001.9140.81578.5788.2583.41−0.577−0.654
WhisperSpeech0.1670.4153.1110.77884.7179.8082.25−1.3551.208
XTTS0.6670.4872.1940.66780.9286.0583.48−1.2890.490

Low Intelligibility Speaker

Speaker E

Low Intelligibility · Rec WER 125%

Voice prompt: “we had played a long while”

Target text: “navigate to o’hare airport”

System Audio WER Spk. Sim. UTMOS PER TTSDS|LibriTTS TTSDS|SAP TTSDS Mean Recon. Log Worth Intel. Log Worth
Recording1.2500.5972.1140.85777.3993.6485.510.0000.000
E2-TTS0.2500.7602.2350.95281.9290.6086.260.6720.198
F5-TTS0.0000.6592.5460.52485.2585.8085.520.1821.464
Fish Speech0.5000.3193.7400.81086.8080.3483.570.3562.340
GPT-SoVITS0.5000.5762.8890.81082.7287.3085.01−0.644−0.071
HierSpeech0.0000.5643.8210.52478.6488.8683.75−0.7550.082
IndexTTS20.0000.6983.1010.66785.2886.9386.110.9631.921
MaskGCT0.2500.5972.5870.76279.3190.8485.070.141−0.054
Metavoice0.5000.4942.4640.85777.5680.8979.23−1.176−1.076
OpenVoice0.0000.0723.5020.57184.0776.8180.44−1.5342.236
Qwen3-TTS0.0000.5523.7320.47686.4786.4486.460.7912.207
StyleTTS20.0000.3544.0380.76285.0280.9182.97−0.7882.447
TorToiSe0.0000.3793.8490.42985.6582.4684.06−0.8281.940
Vevo1.0000.2802.2540.85781.2387.1284.17−0.9440.030
VibeVoice0.2500.4232.5260.66782.9688.6685.81−0.4200.929
VoiceCraft0.5000.4082.6350.85778.5788.2583.41−0.577−0.654
WhisperSpeech0.2500.2863.3000.66784.7179.8082.25−1.3551.208
XTTS0.7500.4222.7070.81080.9286.0583.48−1.2890.490

Listening Test Instructions

The following instructions were shown to participants in Prolific before each listening test.

Reconstruction (Speaker Similarity)

In this study, you will assess synthetic output intended as a personalised communication aid. We have used AI to reconstruct how the speaker sounded before they developed a speech impairment. We ask you to compare the synthetic samples against a real recording of the speaker (reference), and choose the most similar and least similar.

In the cases where the speaker has a speech impairment, do not focus on whether the speech impairment is matched to the reference speaker; instead, think of whether the synthetic output could sound like the person before they developed the impairment. You cannot select the same sample as both most and least similar.

You can play each sample (including the reference) as many times as you need, and change your selected answers as you listen along. In some instances, the audio sample might ask you to select it as best (most similar) or worse (least similar) — these are attention checks, so follow those instructions carefully.

Please wear headphones and be in a quiet environment before starting. When you are ready to start, please click Continue.

Question: Please select the audio sample that is most similar to the reference speaker, and the audio sample that is least similar to the reference speaker.

Remember that, if the speaker has a speech impairment, we want you to choose the sample that could sound closest/farthest to the person before developing the speech impairment.

Intelligibility

In this study, you will assess synthetic output intended as a personalised communication aid. We have used AI to reconstruct how the speaker sounded before they developed a speech impairment. We ask you to choose which audio sample is easiest to understand, and which one is hardest.

Sometimes, you might notice that the audio samples sound like different individuals — focus on how easy or difficult it is to understand only. You cannot select the same sample as both easiest and hardest to understand.

You can play each sample (including the reference) as many times as you need, and change your selected answers as you keep listening along. In some instances, the audio sample might ask you to select it as best (easiest to understand) or worse (hardest to understand) — these are attention checks, so follow those instructions carefully.

In order to progress to the next screen, you need to make sure that you have listened to all audio samples in full at least once.

Please wear headphones and be in a quiet environment before starting. When you are ready to start, please click Continue.

Question: Please select the audio sample that is easiest to understand, and the audio sample that is hardest to understand.

Remember that we want you to focus on how easy it is to understand, regardless of whether the speaker sounds different.