Speaker B
High Intelligibility · Rec WER 0%Voice prompt: “at last billy woodchuck's lips began to feel very weird, puckered up as they were”
Target text: “play the dear evan hansen soundtrack”
| System | Audio | WER | Spk. Sim. | UTMOS | PER | TTSDS|LibriTTS | TTSDS|SAP | TTSDS Mean | Recon. Log Worth | Intel. Log Worth |
|---|---|---|---|---|---|---|---|---|---|---|
| Recording | 0.000 | 0.684 | 1.433 | 0.852 | 77.39 | 93.64 | 85.51 | 0.000 | 0.000 | |
| E2-TTS | 1.333 | 0.736 | 1.494 | 1.037 | 81.92 | 90.60 | 86.26 | 0.672 | 0.198 | |
| F5-TTS | 0.000 | 0.660 | 1.470 | 0.741 | 85.25 | 85.80 | 85.52 | 0.182 | 1.464 | |
| Fish Speech | 0.000 | 0.452 | 2.076 | 0.778 | 86.80 | 80.34 | 83.57 | 0.356 | 2.340 | |
| GPT-SoVITS | 0.167 | 0.381 | 1.628 | 0.519 | 82.72 | 87.30 | 85.01 | −0.644 | −0.071 | |
| HierSpeech | 0.333 | 0.315 | 3.290 | 0.370 | 78.64 | 88.86 | 83.75 | −0.755 | 0.082 | |
| IndexTTS2 | 0.500 | 0.719 | 1.458 | 0.519 | 85.28 | 86.93 | 86.11 | 0.963 | 1.921 | |
| MaskGCT | 0.667 | 0.661 | 1.467 | 0.778 | 79.31 | 90.84 | 85.07 | 0.141 | −0.054 | |
| Metavoice | 0.000 | 0.577 | 1.428 | 0.852 | 77.56 | 80.89 | 79.23 | −1.176 | −1.076 | |
| OpenVoice | 0.000 | 0.263 | 2.827 | 0.333 | 84.07 | 76.81 | 80.44 | −1.534 | 2.236 | |
| Qwen3-TTS | 0.000 | 0.612 | 2.168 | 0.556 | 86.47 | 86.44 | 86.46 | 0.791 | 2.207 | |
| StyleTTS2 | 0.000 | 0.354 | 3.849 | 0.259 | 85.02 | 80.91 | 82.97 | −0.788 | 2.447 | |
| TorToiSe | 0.167 | 0.475 | 2.637 | 0.259 | 85.65 | 82.46 | 84.06 | −0.828 | 1.940 | |
| Vevo | 1.000 | 0.230 | 1.267 | 1.000 | 81.23 | 87.12 | 84.17 | −0.944 | 0.030 | |
| VibeVoice | 0.167 | 0.508 | 1.404 | 0.667 | 82.96 | 88.66 | 85.81 | −0.420 | 0.929 | |
| VoiceCraft | 0.833 | 0.400 | 1.914 | 0.815 | 78.57 | 88.25 | 83.41 | −0.577 | −0.654 | |
| WhisperSpeech | 0.167 | 0.415 | 3.111 | 0.778 | 84.71 | 79.80 | 82.25 | −1.355 | 1.208 | |
| XTTS | 0.667 | 0.487 | 2.194 | 0.667 | 80.92 | 86.05 | 83.48 | −1.289 | 0.490 |