Text to speech is the technology that understands text and natural language based on artificial intelligence to create complete synthetic sounds with appropriate rhythm and intonation. The system will convert the input text into voice audio files with real human intonation.
With Text-To-Speech technology (TTS), human-machine communication becomes easier and more natural than ever. Text-To-Speech can be applied in automatic answering smart switchboards system, public announcement systems, virtual assistants, audio books, audio books, movie narrations, etc.
- Data preparation: record the data within 30-35 hours with a single voice in a studio environment
- Extract the sentences in the recording file into short sentences under 2s. Pay attention to the breaks
- After having the data, move on to the stage of the training machine learning model. In this system, two important models will be used: the first model is the TACOTRON acoustic model, which is developed with Google in 2017 and 2018; the second one is the Vocoder Waveglow model, which is developed by NVIDIA 2018. The combination of the two models gives the best reading voice results nowadays.
- After training the model, then combine the modules. The output of the Tacotron module will be the direct input of the Wavelow module. Waveglow’s output is audio.
- Build an API Webservice to convert text from client’s requests.