Google Deepmind have claimed to have significantly improved a computer's ability to talk with its latest AI technology. This could help usher in a new era of "talking machines" which could be as realistic as those in the sci-fi films "Her" and "Ex-Machina". They could even become so realistic that you could be tricked in thinking you are talking to another person, when in actual fact, it could be a machine talking to you on the other end of the phone.
Google Deepmind, which was acquired by Google in 2014 for a reported £400 million, have named its computer speech technology Wavenet. The technology significantly reduces the gap in quality between real human voices and current speech generation technologies. Current speech generation technology works either by piecing together sounds recorded by real human voices which is notoriously difficult to do or they use electronically generated sounds which are easily tweaked and manipulated but end up sounding too robotic.
The difference with this technology is that instead of sounds being explicitly programmed, pieced together and manipulated, it learns how a human voice should sound by listening to lots of examples. It then generates its own sounds based on what it has learned. Behind the scenes, the technology is an artificial neural network which is trained on samples of real human voices using their raw waveforms. In order to generate speech which is realistic, the team had to feed the neural network 16,000 samples of audio each second, which is incredibly challenging, not least because of the amount of computing power which is required.
In order to pit the technology against existing systems, US English Speaking and Mandarin Chinese speaking participants were played audio clips in their own languages. Two clips were based on Google’s existing concatenative and parametric text-to-speech systems and the other clip was based on WaveNet’s generated speech. Even though the participants acknowledged that WaveNet is a considerable improvement upon existing systems, it still wasn't deemed as good as actual human speech.
So WaveNet is a significant improvement over existing TTS systems and opens up a lot of possibilities such as being able to generate music and is good for audio modelling in general. However, it is still not convincing enough to have it hold a conversation with your wife whilst you watch the footy match and it will be a long way off before we see it in sat-navs or powering a smartphone app for example. Still, I'm exited for what the future may hold, because as computers get more and more powerful, we could start to see this technology in use in our daily lives and we could start to have natural conversations with chatbots. If you can't wait though, have a listen to some samples on DeepMind's website, personally I think they are very good, though still slightly robotic. Let me know what you think in the comments.