Archive for Ideas

How to make a talking moose

I spoke with a prominent sound designer for animated features last night. He posed a rather intriguing problem: How do you make a talking moose sound organically like a talking moose? How do we create a voice that would represent a talking moose? How do we put the acoustic filters in place to take a voice and make it sound as if the human speech organs were inside the resonant cavity of a moose?

The point is, what’s needed at the moment is to devise for sound the same sorts of tool set that computer graphic designers have at their disposal. We need to develop the tool set for sound manipulation that produces true organic-sounding products. We don’t need to create the sounds wholecloth. Think of photo-manipulating software. We’ve got things to start with. We can make the recordings. The problem is how do we manipulate the sound without creating all sorts of digital noise? How do we make the filters that change a moose into a goose into a hedgehog, and how do we take a fast-speaking New Yorker, and make them sound like a Georgian, or better yet, how do we produce a filter to speak French with a Russian accent?

It is a problem whose resolution will depend on pulling together the right team of people, from a variety of backgrounds, using a variety of approaches. We need to understand what goes into the sounds in the first place that creates the identity of a fast-talking, angry, New York cabbie or a slow-talking, treacly Atlanta land salesman. What are the features of a Russian speaking French that differ from those of a native speaker? I’ll give you a hint: It’s not as simple as the phoneme set. So, we need some people to take apart the real organic sounds, while we’ve got others working on putting them back together. There’s a great deal of work being done on the latter half, but very little on the former. It’s time to put them together.

This will be done. It’s just a question of who, and when.

Comments

Automatic voice feature extraction

Elliott D. Ross and colleagues have long studied the impact of particular right hemisphere neuropathologies on affective speech prosody, syndromes collectively termed the aprosodias. (See the Song, Speech, and Brain bibliography for some details). If we develop the tools for automatic extraction of voice features (ones that would be necessary to produce animated synthetic voices), it would be possible to see a future where audio recordings of patients speaking would become a normal part of a medical file. These audio recordings could be subject to automatic analysis and extraction of individual voice features. A comparison from such baseline recordings with post-event recordings could provide cues to identifying neuropathologies that might be otherwise missed. They could also serve as a method for the analysis and quantification of dysarthria and other voice affecting disorders.

These features as well must certainly play a role in voice identification and verification systems. The problem at hand is finding a way to automatically (and reliably) extract these features of voice (timing, pitch, phonemes/allophonic variation, timbre) and to classify them for analysis and comparison.

Such systems could go beyond medical applications as well. There is no reason why automatic extraction of features couldn’t be applied for military and intelligence applications, to quickly identify dialects and languages, or be able to recognize an impostor, someone speaking a non-native dialect or language. These systems could also be used for pedagogical purposes to assist learners in acquiring a near-native accent in a foreign language, by providing a better understanding of the features common to native speakers, and analysis and feedback on the learner’s production. This would be a giant stride forward from the overly simplistic acoustic language learning tools (that provide too literal a comparison from model to learner), which are currently available.

Is anyone working on developing these tools? I’ve heard nothing. Anyone interested?

Comments

Voices for entertainment and beyond

I’ve been thinking a lot lately about the challenges of creating realistic voices for video games, feature films, and beyond. What’s beyond: I see us developing computer systems that learn language in the way that a human does, through the combination of inborn mechanisms and lived experience. Idiolects reflect the individual, from the statistical analysis that leads them to pronounce a word in a particular way, to the will that motivates a given choice of word. The future will bring such systems, or we will have failed in harnessing the technology. My thought is the challenge of producing them for money-making ventures like the entertainment industry could become the Manhattan Project for voice technology.

Right now, we have glacial improvements, in part because too much of the field is hampered by asking only certain types of questions, using certain types of methods for approaching them. Mostly, from what I can tell the speech technology industry is overwhelmingly dominated by electrical engineers and computer scientists, and unfortunately they become more and more entrenched. Their strides have been impressive. But we’re not going to solve the problems of extracting the salient features of natural speech prosody, describing them, codifying them, and reproducing them in sythesized voices unless we open up the field to a variety of methods. We need musicians and linguists, psychologists and actors, and who knows what, to help solve these problems.

We need some real challenges, some exciting ones, with great potential benefits. I think creating voices for the entertainment industry is one such challenge. In doing a quick search around the web, I came across the following essay by Keith Wiley, from April 2001. So, I’m not the first to think of this. Forgive his sometime coarseness, and the occasional typo. I like his enthusiasm, and his spirit.

Any thoughts?

Comments (1)

Leitmotivation

I had an idea recently, in the vein of sound design, or score writing for movies, games, what have you. I’m wondering if anyone is working on this. The idea is simple: extract certain salient patterns of melody and rhythm from the speech of an individual, transforming those patterns directly into musical motives, to serve both as leitmotifs and as materials for variations and development.

I think for instance of my two-year old the other morning. He woke up cranky, awaking his older brother, who came to our bedroom, and announced the situation, while crawling into bed with us. I invited the two-year-old to join us, to which he replied:

which he repeated numerous times. I realize, of course, that this is a paradigm for dismissal or dislike. The pattern is clear, large rise (greater than an octave) at the beginning, a short pick up to a medium-length (or possibly long) accented noted, followed by rapid descent and fall in amplitude. The number of syllables/notes following the accent is mostly irrelevant, as long as it is at least three, it would seem. What a great motive for a character.

Is anyone doing this sort of thing today? Extracting real motives from snippets of speech, transforming them into musical motives, then using them as leitmotifs and fodder for musical development?

Comments

Register Login
Locations of visitors to this page