A new California chapter of the Applied Voice Input/Output Society is currently forming. Anyone interested in becoming involved, or in being informed of upcoming events, please contact jonathan at)musiclanguage.net.
Archive for April, 2007
The Peter Wall Institute for Advanced Studies at the University of British Columbia (www.pwias.ubc.ca) is hosting a 3-day interdisciplinary Exploratory Workshop June 21-23 in conjunction with the Vancouver International Song Institute (www.visi.ca), a new and unique interdisciplinary professional training program for the study of Art Song, at the UBC School of Music June 17-23. The title of the workshop is “Art Song Anima: Ambiguity, Authenticity, Augury”, convened by Professor Rena Sharon, Artistic Director of VISI, and Drs. Eric Vatikiotis-Bateson, Linguistics, and Laurel Fais, Psychology. Its topics flow from an arts/humanities starting point on the first day (ambiguities and specificities in the setting of poetry to music), into discussion on day two of the phenomenology of speech/song intersections, comprising linguistics, vocal physiology, cognition, neuroscience. The final day will include consideration of song from a biocultural perspective, with presentation of data about the use of song in therapeutic environments such as Alzheimer’s’ care, and its evolutionary role in individual development of parent/infant communication and collective social ritual.
I spoke with a prominent sound designer for animated features last night. He posed a rather intriguing problem: How do you make a talking moose sound organically like a talking moose? How do we create a voice that would represent a talking moose? How do we put the acoustic filters in place to take a voice and make it sound as if the human speech organs were inside the resonant cavity of a moose?
The point is, what’s needed at the moment is to devise for sound the same sorts of tool set that computer graphic designers have at their disposal. We need to develop the tool set for sound manipulation that produces true organic-sounding products. We don’t need to create the sounds wholecloth. Think of photo-manipulating software. We’ve got things to start with. We can make the recordings. The problem is how do we manipulate the sound without creating all sorts of digital noise? How do we make the filters that change a moose into a goose into a hedgehog, and how do we take a fast-speaking New Yorker, and make them sound like a Georgian, or better yet, how do we produce a filter to speak French with a Russian accent?
It is a problem whose resolution will depend on pulling together the right team of people, from a variety of backgrounds, using a variety of approaches. We need to understand what goes into the sounds in the first place that creates the identity of a fast-talking, angry, New York cabbie or a slow-talking, treacly Atlanta land salesman. What are the features of a Russian speaking French that differ from those of a native speaker? I’ll give you a hint: It’s not as simple as the phoneme set. So, we need some people to take apart the real organic sounds, while we’ve got others working on putting them back together. There’s a great deal of work being done on the latter half, but very little on the former. It’s time to put them together.
This will be done. It’s just a question of who, and when.
It would seem a whole new realm of ethical and legal considerations will arise out of the development of synthetic voices based, at least in part, on sampling of natural speech. One easy way to avoid these concerns, I suppose, would be to hire speakers, or use insider voices for the immediate needs of production, having all rights waived under contract. But these issues may arise anyhow from the capture and analysis end. That is, for instance, if one wishes to analyze a great deal of data from a particular region or dialect, it would be necessary to capture a range of speakers. Since the interest is in capturing natural data, this might best be accomplished if the speakers are unaware that they are being recorded. But would such eavesdropping be ethical, and would it be legal?
What if snippets of actual speakers were used for the later development of voices? Would the original speaker be recognizable? It would almost assuredly be possible to modify the resultant sound such that the speaker would not be recognizable. But would the product still be in some way legally tied to the original speaker? Does intellectual property extend to the products of our own voices? What if the sound was captured in public rather than clandestinely? Wouldn’t it be akin to publishing pictures of famous people who appeared in public? That is, would public presentation render moot any claims to intellectual or personal property rights? The problem of course would be ensuring the requisite sound quality under such conditions.
The ethics of this come up, even if the voice could be altered to mask the identity (i.e. by significantly changing the timbre and other prosodic qualities). I think these issues will have to be dealt with at some point. I recall a composers’ presentation a few years ago, in which he admitted to sampling some performance, which was later manipulated and modified to the extent that perhaps only he knew that the original had been used. Nonetheless, he felt it necessary to say something about, to acknowledge his guilt regarding the matter. Just something to think about.
Elliott D. Ross and colleagues have long studied the impact of particular right hemisphere neuropathologies on affective speech prosody, syndromes collectively termed the aprosodias. (See the Song, Speech, and Brain bibliography for some details). If we develop the tools for automatic extraction of voice features (ones that would be necessary to produce animated synthetic voices), it would be possible to see a future where audio recordings of patients speaking would become a normal part of a medical file. These audio recordings could be subject to automatic analysis and extraction of individual voice features. A comparison from such baseline recordings with post-event recordings could provide cues to identifying neuropathologies that might be otherwise missed. They could also serve as a method for the analysis and quantification of dysarthria and other voice affecting disorders.
These features as well must certainly play a role in voice identification and verification systems. The problem at hand is finding a way to automatically (and reliably) extract these features of voice (timing, pitch, phonemes/allophonic variation, timbre) and to classify them for analysis and comparison.
Such systems could go beyond medical applications as well. There is no reason why automatic extraction of features couldn’t be applied for military and intelligence applications, to quickly identify dialects and languages, or be able to recognize an impostor, someone speaking a non-native dialect or language. These systems could also be used for pedagogical purposes to assist learners in acquiring a near-native accent in a foreign language, by providing a better understanding of the features common to native speakers, and analysis and feedback on the learner’s production. This would be a giant stride forward from the overly simplistic acoustic language learning tools (that provide too literal a comparison from model to learner), which are currently available.
Is anyone working on developing these tools? I’ve heard nothing. Anyone interested?
I’ve been thinking a lot lately about the challenges of creating realistic voices for video games, feature films, and beyond. What’s beyond: I see us developing computer systems that learn language in the way that a human does, through the combination of inborn mechanisms and lived experience. Idiolects reflect the individual, from the statistical analysis that leads them to pronounce a word in a particular way, to the will that motivates a given choice of word. The future will bring such systems, or we will have failed in harnessing the technology. My thought is the challenge of producing them for money-making ventures like the entertainment industry could become the Manhattan Project for voice technology.
Right now, we have glacial improvements, in part because too much of the field is hampered by asking only certain types of questions, using certain types of methods for approaching them. Mostly, from what I can tell the speech technology industry is overwhelmingly dominated by electrical engineers and computer scientists, and unfortunately they become more and more entrenched. Their strides have been impressive. But we’re not going to solve the problems of extracting the salient features of natural speech prosody, describing them, codifying them, and reproducing them in sythesized voices unless we open up the field to a variety of methods. We need musicians and linguists, psychologists and actors, and who knows what, to help solve these problems.
We need some real challenges, some exciting ones, with great potential benefits. I think creating voices for the entertainment industry is one such challenge. In doing a quick search around the web, I came across the following essay by Keith Wiley, from April 2001. So, I’m not the first to think of this. Forgive his sometime coarseness, and the occasional typo. I like his enthusiasm, and his spirit.