Realistic Voice synthesis and natural speech comprehension
Here is a question out to my readers: Is anyone developing a realistic system of voice synthesis, that takes into account the prosody, especially the melody and rhythm, of natural speech? On the other end, what work is being done to facilitate machine comprehension of natural speech, in particular the meaning of speech prosody?
From what I can tell, one of the big hold ups for this research is the existing systems of transcription, and the correlated theories of prosody, most of which seem geared toward structural elements, and away from anything that smacks of affective content. Systems like ToBI may be well suited to codifying and describing certain aspects of speech prosody, and comparing them across languages, but they are decidedly unsuited for describing affective speech prosody. In support of this argument, I would say that widely divergent stimuli might be notated as similar in a system, for example, that addresses pitch merely as high or low, even attending to directionality. Before this year is out, I hope to produce some publishable material that addresses this specific issue, that shows the (unacceptable) results of transcribing natural data according to various popular systems of transcription. I welcome any input to counter or support this supposition.
And what about durations? It is clear from so much research that the natural flow of speech is divided into smaller intonational phrases or intonation units, and that one major marker for these divisions is the ebb and flow of time, specifically the common lenghthening at the ends of units, and the nearly as common rushing or anacrusis that is observed at beginnings. Are there any existing systems of transcription and description that place these aspects of time on an equal footing with pitch movements?
Voice recognition has come a long way, yet it is still a far way off from allowing machines to comprehend the natural flow of speech, much less the underlying emotional meaning that lies hidden beneath that surface. At present, humans still bow to the needs of computers. I envision in a generation or two that this will change. I am truly amazed at the giant strides that have been made in the realm of computer graphics, as exhibited for instance by Pixar and Dreamworks. Yet, they all still hire voice over actors. Let’s set a goal in music and language research to bring the quality of voice synthesis to the level of computer graphics before the children born in 2007 graduate from college.
