I’ve been thinking a lot lately about the challenges of creating realistic voices for video games, feature films, and beyond. What’s beyond: I see us developing computer systems that learn language in the way that a human does, through the combination of inborn mechanisms and lived experience. Idiolects reflect the individual, from the statistical analysis that leads them to pronounce a word in a particular way, to the will that motivates a given choice of word. The future will bring such systems, or we will have failed in harnessing the technology. My thought is the challenge of producing them for money-making ventures like the entertainment industry could become the Manhattan Project for voice technology.
Right now, we have glacial improvements, in part because too much of the field is hampered by asking only certain types of questions, using certain types of methods for approaching them. Mostly, from what I can tell the speech technology industry is overwhelmingly dominated by electrical engineers and computer scientists, and unfortunately they become more and more entrenched. Their strides have been impressive. But we’re not going to solve the problems of extracting the salient features of natural speech prosody, describing them, codifying them, and reproducing them in sythesized voices unless we open up the field to a variety of methods. We need musicians and linguists, psychologists and actors, and who knows what, to help solve these problems.
We need some real challenges, some exciting ones, with great potential benefits. I think creating voices for the entertainment industry is one such challenge. In doing a quick search around the web, I came across the following essay by Keith Wiley, from April 2001. So, I’m not the first to think of this. Forgive his sometime coarseness, and the occasional typo. I like his enthusiasm, and his spirit.
Any thoughts?