1 Karlsruhe Institute of Technology (KIT), Germany
2 Carnegie Mellon University (CMU), USA
3 KIT Campus Transfer (KCT), Germany
Humans tend to speak louder and clearer in challenging environments, such as noisy conditions or when addressing hearing-impaired listeners, which is called Lombard effect. To simulate this behavior in speech synthesis systems, we introduce a flow-matching based text-to-speech (TTS) model trained with vocal effort and articulation pseudo-labels. The proposed model achieves continuous and disentangled control of vocal effort and articulation, while also enabling word-level emphasis for clarifying specific segments of an utterance. Experimental results show that these control mechanisms effectively improve clarity-related acoustic features. Furthermore, speech-in-noise experiments demonstrate that our model successfully simulates the intelligibility gains of human clear speech in noisy conditions.
| Control Mode | Spk | Base (0.5 / 0.3) | Mid Scaling | High Scaling |
|---|---|---|---|---|
| Articulation Axis (Fixed Vocal Effort @ 0.3) | ||||
| Varying Articulation (0.5 → 0.9) |
1 | |||
| 2 | ||||
| 3 | ||||
| 4 | ||||
| Vocal Effort Axis (Fixed Articulation @ 0.5) | ||||
| Varying Vocal Effort (0.3 → 0.9) |
1 | |||
| 2 | ||||
| 3 | ||||
| 4 | ||||
| Joint Scaling (Full Lombard Shift) | ||||
| Both Parameters Scaling | 1 | |||
| 2 | ||||
| 3 | ||||
| 4 | ||||
The birch canoe slid on the smooth planks.
The fish twisted and turned on the bent hook.
The box was thrown beside the parked truck.
The friendly gang left the drug store.
Take the winding path to reach the lake.
Mesh wire keeps chicks inside.
The girl at the booth sold fifty bonds.
Mesh wire keeps chicks inside.