Modern Approaches to Controllable Emotional Speech Synthesis

Дмитро Сергійович Іващенко; Олександр Олександрович Марченко

doi:10.18523/2617-3808.2025.8.28-37

Автор(и)

Дмитро Сергійович Іващенко Національний університет «Києво-Могилянська академія», Україна https://orcid.org/0009-0000-7504-6984
Олександр Олександрович Марченко Київський національний університет імені Тараса Шевченка, Україна https://orcid.org/0000-0002-5408-5279

DOI:

https://doi.org/10.18523/2617-3808.2025.8.28-37

Ключові слова:

глибоке навчання, синтез мовлення з тексту, обробка природної мови, емоційний контроль мовлення, дифузійні моделі

Анотація

У статті представлено комплексний огляд сучасних технологій керованих систем для емоційного синтезу мовлення. Проаналізовано еволюцію нейронних архітектур, систематизовано підходи за технологіями та методами емоційного контролю. Визначено ключові виклики галузі, що охоплюють відокремлення мовленнєвих ознак та дефіцит даних для мов з обмеженими ресурсами. Окреслено перспективні напрями розвитку систем емоційно контрольованого синтезу мовлення.

Біографії авторів

Дмитро Сергійович Іващенко, Національний університет «Києво-Могилянська академія»

студент PhD програми «Комп’ютерні науки» факультету інформатики Національного університету «Києво-Могилянська академія», d.ivashchenko@ukma.edu.ua

Олександр Олександрович Марченко, Київський національний університет імені Тараса Шевченка

доктор фізико-математичних наук, професор кафедри математичної інформатики факультету комп’ютерних наук та кібернетики Київського національного університету імені Тараса Шевченка, omarchenko@univ.kiev.ua

Посилання

Babacan, O., Drugman, T., d’Alessandro, N., Henrich Bernardoni, N., & Dutoit, T. (2013). A comparative study of pitch extraction algorithms on a large variety of singing sounds. ICASSP 2013 - Proceedings, 1–5. https://hal.science/hal-00923967.
Bales, R. F. (2017). Social interaction systems: Theory and measurement. Transaction Publishers. https://doi.org/10.4324/9781315129563.
Barakat, H., Turk, O., & Demiroglu, C. (2024). Deep learning-based expressive speech synthesis: A systematic review of approaches, challenges, and resources. EURASIP Journal on Audio, Speech, and Music Processing, 2024 (1), 11. https://doi.org/10.1186/s13636-024-00329-7.
Cho, D.-H., Oh, H.-S., Kim, S.-B., Lee, S.-H., & Lee, S.-W. (2024). EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech. Interspeech 2024, 1810–1814. https://doi.org/10.21437/Interspeech.2024-398
Cho, D.-H., Oh, H.-S., Kim, S.-B., & Lee, S.-W. (2025). EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector. IEEE Transactions on Affective Computing, 1–16. https://doi.org/10.1109/TAFFC.2025.3561267.
Du, Z., Chen, Q., Zhang, S., Hu, K., Lu, H., Yang, Y., Hu, H., Zheng, S., Gu, Y., Ma, Z., Gao, Z., & Yan, Z. (2024). CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens (No. arXiv:2407.05407). arXiv. https://doi.org/10.48550/arXiv.2407.05407.
Gao, X., Zhang, C., Chen, Y., Zhang, H., & Chen, N. F. (2024). Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization (No. arXiv:2409.10157). arXiv. https://doi.org/10.48550/arXiv.2409.10157.
Guo, Z., Leng, Y., Wu, Y., Zhao, S., & Tan, X. (2023). Prompttts: Controllable Text-To-Speech With Text Descriptions. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. https://doi.org/10.1109ICASSP49357.2023.10096285.
Huang, R., Ren, Y., Liu, J., Cui, C., & Zhao, Z. (2022). GenerSpeech: Towards style transfer for generalizable out-of-domain text-to-speech. Proceedings of the 36th International Conference on Neural Information Processing Systems.
Inoue, S., Zhou, K., Wang, S., & Li, H. (2024). Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 10601–10605.
Kim, D., Hong, S., & Choi, Y.-H. (2023). SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer (No. arXiv:2307.10550). arXiv. https://doi.org/10.48550/arXiv.2307.10550.
Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. Advances in Neural Information Processing Systems, 33, 8067–8077.
Kim, M., Cheon, S. J., Choi, B. J., Kim, J. J., & Kim, N. S. (2021). Expressive Text-to-Speech Using Style Tag. Interspeech 2021, 4663–4667. https://doi.org/10.21437/Interspeech.2021-465.
Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems, 33, 17022–17033.
Lei, Y., Yang, S., Wang, X., & Xie, L. (2022). MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 30, 853–864. https://doi.org/10.1109/TASLP.2022.3145293.
Leng, Y., Guo, Z., Shen, K., Tan, X., Ju, Z., Liu, Y., Liu, Y., Yang, D., Zhang, L., Song, K., He, L., Li, X.-Y., Zhao, S., Qin, T., & Bian, J. (2024). PromptTTS 2: Describing and Generating Voices with Text Prompt. The Twelfth International Conference on Learning Representations.
Li, Y. A., Han, C., & Mesgarani, N. (2025). StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis. IEEE Journal of Selected Topics in Signal Processing, 19 (1), 283–296. https://doi.org/10.1109/JSTSP.2025.3530171.
Li, Y. A., Han, C., Raghavan, V., Mischler, G., & Mesgarani, N. (2023). StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. В A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems (Vol. 36, рp. 19594–19621). Curran Associates, Inc.
Liu, J., & Ling, Z. (2025). UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech (No. arXiv:2505.10599). arXiv. https://doi.org/10.48550/arXiv.2505.10599.
Mu, Z., Yang, X., & Dong, Y. (2021). Review of end-to-end speech synthesis technology based on deep learning (No. arXiv:2104.09995). arXiv. https://doi.org/10.48550/arXiv.2104.09995.
Plutchik, R., & Kellerman, H. (1980). Theories of Emotion. Elsevier Science.
Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2021). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. International Conference on Learning Representations. https://openreview.net/forum?id=piLPYqxtWuA.
Sivaprasad, S., Kosgi, S., & Gandhi, V. (2021). Emotional Prosody Control for Speech Generation. Interspeech 2021, 4653–4657. https://doi.org/10.21437/Interspeech.2021-307.
Tan, X., Qin, T., Soong, F., & Liu, T.-Y. (2021). A Survey on Neural Speech Synthesis (No. arXiv:2106.15561). arXiv. https://doi.org/10.48550/arXiv.2106.15561.
Tang, H., Zhang, X., Wang, J., Cheng, N., & Xiao, J. (2023). EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis. INTERSPEECH 2023, 12–16. https://doi.org/10.21437/Interspeech.2023-1317.
Valle, R., Li, J., Prenger, R., & Catanzaro, B. (2020). Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6189–6193. https://doi.org/10.1109/ICASSP40776.2020.9054556.
Valle, R., Shih, K. J., Prenger, R., & Catanzaro, B. (2021). Flowtron: An Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis. International Conference on Learning Representations. https://openreview.net/forum?id=Ig53hpHxS4.
Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., Chen, Z., Liu, Y., Wang, H., Li, J., He, L., Zhao, S., & Wei, F. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (No. arXiv:2301.02111). arXiv. https://doi.org/10.48550/arXiv.2301.02111.
Wang, Y., Stanton, D., Zhang, Y., Ryan, R.-S., Battenberg, E., Shor, J., Xiao, Y., Jia, Y., Ren, F., & Saurous, R. A. (2018). Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. Proceedings of the 35th International Conference on Machine Learning, 5180–5189. https://proceedings.mlr.press/v80/wang18h.html.
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452.
Xie, T., Rong, Y., Zhang, P., Wang, W., & Liu, L. (2025). Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey (No. arXiv:2412.06602; arXiv. https://doi.org/10.48550/arXiv.2412.06602.
Yang, D., Liu, S., Huang, R., Weng, C., & Meng, H. (2024). InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 2913–2925. https://doi.org/10.1109/TASLP.2024.3402088.
Zhang, C., Zhang, C., Zheng, S., Zhang, M., Qamar, M., Bae, S.-H., & Kweon, I. S. (2023). A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI (No. arXiv:2303.13336). arXiv. https://doi.org/10.48550/arXiv.2303.13336.
Zhou, K., Sisman, B., Rana, R., Schuller, B. W., & Li, H. (2023). Speech Synthesis With Mixed Emotions. IEEE Transactions on Affective Computing, 14 (4), 3120–3134. https://doi.org/10.1109/TAFFC.2022.3233324.

Сучасні підходи до контрольованого синтезу емоційного мовлення

Автор(и)

DOI:

Ключові слова:

Анотація

Біографії авторів

Дмитро Сергійович Іващенко, Національний університет «Києво-Могилянська академія»

Олександр Олександрович Марченко, Київський національний університет імені Тараса Шевченка

Посилання

##submission.downloads##

Опубліковано

Номер

Розділ

Ліцензія

Інформація

##plugins.block.developedBy.blockTitle##