Archive for Issues

Ethical and legal considerations of sampling

It would seem a whole new realm of ethical and legal considerations will arise out of the development of synthetic voices based, at least in part, on sampling of natural speech. One easy way to avoid these concerns, I suppose, would be to hire speakers, or use insider voices for the immediate needs of production, having all rights waived under contract. But these issues may arise anyhow from the capture and analysis end. That is, for instance, if one wishes to analyze a great deal of data from a particular region or dialect, it would be necessary to capture a range of speakers. Since the interest is in capturing natural data, this might best be accomplished if the speakers are unaware that they are being recorded. But would such eavesdropping be ethical, and would it be legal?

What if snippets of actual speakers were used for the later development of voices? Would the original speaker be recognizable? It would almost assuredly be possible to modify the resultant sound such that the speaker would not be recognizable. But would the product still be in some way legally tied to the original speaker? Does intellectual property extend to the products of our own voices? What if the sound was captured in public rather than clandestinely? Wouldn’t it be akin to publishing pictures of famous people who appeared in public? That is, would public presentation render moot any claims to intellectual or personal property rights? The problem of course would be ensuring the requisite sound quality under such conditions.

The ethics of this come up, even if the voice could be altered to mask the identity (i.e. by significantly changing the timbre and other prosodic qualities). I think these issues will have to be dealt with at some point. I recall a composers’ presentation a few years ago, in which he admitted to sampling some performance, which was later manipulated and modified to the extent that perhaps only he knew that the original had been used. Nonetheless, he felt it necessary to say something about, to acknowledge his guilt regarding the matter. Just something to think about.

Comments

Automatic voice feature extraction

Elliott D. Ross and colleagues have long studied the impact of particular right hemisphere neuropathologies on affective speech prosody, syndromes collectively termed the aprosodias. (See the Song, Speech, and Brain bibliography for some details). If we develop the tools for automatic extraction of voice features (ones that would be necessary to produce animated synthetic voices), it would be possible to see a future where audio recordings of patients speaking would become a normal part of a medical file. These audio recordings could be subject to automatic analysis and extraction of individual voice features. A comparison from such baseline recordings with post-event recordings could provide cues to identifying neuropathologies that might be otherwise missed. They could also serve as a method for the analysis and quantification of dysarthria and other voice affecting disorders.

These features as well must certainly play a role in voice identification and verification systems. The problem at hand is finding a way to automatically (and reliably) extract these features of voice (timing, pitch, phonemes/allophonic variation, timbre) and to classify them for analysis and comparison.

Such systems could go beyond medical applications as well. There is no reason why automatic extraction of features couldn’t be applied for military and intelligence applications, to quickly identify dialects and languages, or be able to recognize an impostor, someone speaking a non-native dialect or language. These systems could also be used for pedagogical purposes to assist learners in acquiring a near-native accent in a foreign language, by providing a better understanding of the features common to native speakers, and analysis and feedback on the learner’s production. This would be a giant stride forward from the overly simplistic acoustic language learning tools (that provide too literal a comparison from model to learner), which are currently available.

Is anyone working on developing these tools? I’ve heard nothing. Anyone interested?

Comments

Voices for entertainment and beyond

I’ve been thinking a lot lately about the challenges of creating realistic voices for video games, feature films, and beyond. What’s beyond: I see us developing computer systems that learn language in the way that a human does, through the combination of inborn mechanisms and lived experience. Idiolects reflect the individual, from the statistical analysis that leads them to pronounce a word in a particular way, to the will that motivates a given choice of word. The future will bring such systems, or we will have failed in harnessing the technology. My thought is the challenge of producing them for money-making ventures like the entertainment industry could become the Manhattan Project for voice technology.

Right now, we have glacial improvements, in part because too much of the field is hampered by asking only certain types of questions, using certain types of methods for approaching them. Mostly, from what I can tell the speech technology industry is overwhelmingly dominated by electrical engineers and computer scientists, and unfortunately they become more and more entrenched. Their strides have been impressive. But we’re not going to solve the problems of extracting the salient features of natural speech prosody, describing them, codifying them, and reproducing them in sythesized voices unless we open up the field to a variety of methods. We need musicians and linguists, psychologists and actors, and who knows what, to help solve these problems.

We need some real challenges, some exciting ones, with great potential benefits. I think creating voices for the entertainment industry is one such challenge. In doing a quick search around the web, I came across the following essay by Keith Wiley, from April 2001. So, I’m not the first to think of this. Forgive his sometime coarseness, and the occasional typo. I like his enthusiasm, and his spirit.

Any thoughts?

Comments (1)

BCOME 2007 (Brevard Conference on Music Entrepreneurship) CfP

Brevard Conference on Music Entrepreneurship
Brevard Conference on Music Entrepreneurship

Call For Papers

Brevard Conference on Music Entrepreneurship
When: July 27-29, 2007
Where: Brevard, North Carolina

Panel: “Disciplining Entrepreneurship in Music Higher Education”

America’s music schools are adopting entrepreneurship education at a steady rate. However, the lack of an accepted definition or conception of “entrepreneurship” has spawned a diverse range of curricular structuring. Concurrently, a lack of scholarship concerning these efforts has buttressed perceptions of “entrepreneurship education in music” as “business education for music students.”

With new and progressive literature on entrepreneurship emerging from the economic, cognitive and social sciences, many Music Entrepreneurship programs (and students) have yet to reap the rewards of this scholarship. As this field emerges, developing a solid intellectual foundation is critical to the success and sustainability of these efforts.

The Brevard Conference on Music Entrepreneurship invites papers that address Entrepreneurship education in American music training. We are particularly interested in papers that explore:

1) Theoretical or philosophical structuring
2) Curricular and program design
3) New approaches to pedagogy
4) Interdisciplinary connections
5) Conceptualizations of “Entrepreneurship” in the context of Music training
6) Continuities and discontinuities of entrepreneurship education in business and arts curricula

Please send a 250 word abstract by email to archlute@mail.utexas.edu.
Deadline for abstracts is May 1, 2007. Papers will be limited to 10 minutes (approximately) 8 pages, double spaced. Inquiries concerning submissions are encouraged.

Comments

Leitmotivation

I had an idea recently, in the vein of sound design, or score writing for movies, games, what have you. I’m wondering if anyone is working on this. The idea is simple: extract certain salient patterns of melody and rhythm from the speech of an individual, transforming those patterns directly into musical motives, to serve both as leitmotifs and as materials for variations and development.

I think for instance of my two-year old the other morning. He woke up cranky, awaking his older brother, who came to our bedroom, and announced the situation, while crawling into bed with us. I invited the two-year-old to join us, to which he replied:

which he repeated numerous times. I realize, of course, that this is a paradigm for dismissal or dislike. The pattern is clear, large rise (greater than an octave) at the beginning, a short pick up to a medium-length (or possibly long) accented noted, followed by rapid descent and fall in amplitude. The number of syllables/notes following the accent is mostly irrelevant, as long as it is at least three, it would seem. What a great motive for a character.

Is anyone doing this sort of thing today? Extracting real motives from snippets of speech, transforming them into musical motives, then using them as leitmotifs and fodder for musical development?

Comments

The reliability of pause as a cue in speech

A question has recently come up regarding the reliability of pause as a cue to the segmentation of speech into intonational, or semantic meaning groups [1]. A few years ago, I had prepared a paper in conjunction with a colleague, Pentti Haddington, which addressed the question of the unreliability of pause in this context (click here for PowerPoint Slide Show). In our findings, pause was neither sufficient nor necessary by itself as a cue. Rather, pause sometimes co-appeared with other cues, and the conjunction of these cues together served to demark segmentation.

There is an important distinction that must be made between two types of pauses: the silent pause, which is perhaps what is most commonly referred to by the term; and the filled pause. The filled pause can be seen as a lengthening, or as a hesitation (each likely with its own causes and meaning). I believe that the filled pause, and hesitation are likely more reliable cues. The question then is how might one automatically extract the acoustic signatures of these cues, in order to use them for parsing in speech recognition?

Is anyone working on these issues?

[1] See for example, Seligman, M. “Nine Issues in Speech Translation,” Machine Translation, v. 15, no. 1/2, June 2000, pp. 149-186. This specific issue is discussed in section 5.

Comments

Superhuman Speech system (SHS)

I found an article on research being done at IBM, that seeks to address many of the issues I raised in an earlier post. Any comments?

Comments

Realistic Voice synthesis and natural speech comprehension

Here is a question out to my readers: Is anyone developing a realistic system of voice synthesis, that takes into account the prosody, especially the melody and rhythm, of natural speech? On the other end, what work is being done to facilitate machine comprehension of natural speech, in particular the meaning of speech prosody?

Read the rest of this entry »

Comments

Music & Language: Parallels & Divergences

Attached are lecture notes, and the PowerPoint slide show from the talk “Music & Language: Parallels & Divergences” which was presented to the Cognitive and Perceptual Sciences (CaPS) at the University of California, Santa Barbara, on November 30, 2001.

Music & Language: Parallels & Divergences (.pdf)
Music & Language: Parallels & Divergences (PowerPoint)

Comments (2)

Introduction to Evolutionary Musicology added

The presentation “An Introduction to Evolutionary Musicology” originally given at the Conference of the International Musicological Society in Leuven, Belgium, August 2002, has been added under Conferences/Presentations.

Comments

Infant Sound Environment Project (ISEP)

The Infant Sound Environment Project (ISEP) is a longitudinal study of the sound inputs to infants and the relationship of these inputs to the sound production of these children as they emerge from infancy. Follow on research will address aspects of perceptual equivalence, to better understand this relationship. While previous studies have addressed the acquisition of words and grammar—how meaning and form emerge in the human mind—the present study will address a different aspect of this experience, namely the melodic and rhythmic elements of human vocal sounds, which play a major part in the expression and comprehension of emotions and attitudes. [1] These aspects of social and communicative behaviours, in language and music, carry a fundamental layer of meaning that has heretofore gone largely unexplored. Their foregrounding in this study will permit us to explore patterning, imitation, and creativity, without unduly prejudicing our assumptions regarding the nature of these vocal sounds.

It is often posited that language is unique to our species, and that what it contributes to our being is nothing less than defining of our nature. [2] The intent of this study is not directly to challenge this notion, but rather to put question to what fundamentally characterizes language in this regard. If language is the defining element of humanity, what is language? The answer to this question underlies the present research programme. It is quite possible that the “what” that will emerge is not exclusive to the domain of language, but rather more generally applicable to human social and communicative interactions, in the human capacity for pattern recognition within our natural environment. While clearly there are aspects of language that are outside the domain of sound production and perception (visual cues and gesture, as well as sign languages which are entirely exclusive of sound), it is my contention that the systems of pattern recognition and imitation that will be in evidence through this study are likely generalizable and comparable to other behaviours, rather than of a different nature. [3] Read the rest of this entry »

Comments

The Competition Model and its Relevance for Speech/Song Research

Jonathan G. Secora Pearl
Department of Linguistics
University of California, Santa Barbara

Corresponding address:

Jonathan Pearl
Music & Language Studies
7220 N. Rosemead Blvd., Suite 202-10
San Gabriel, CA 91775

email: type”jonathan@musiclanguage.net”

ABSTRACT

The emerging field of music and language studies draws on the traditions and techniques of linguistics and musicology, with an empirical and cognitive bent. The present paper examines the relevance of the Competition Model from psycholinguistics on research that straddles the territories of speech prosody and music, in particular addressing the production and perception of the musical aspects (pitch, timing, amplitude, and timbre) of human vocal sounds.

INTRODUCTION

The Competition Model is an emergentist model for human language. It assumes that human brains develop according to a genetically-specified though plastic plan, which includes certain preferences in computing style arising in particular regions or pathways of the brain, as a result of native architectural and timing mechanisms. This is in contrast with nativist theories that implicitly presume innate representations, of grammar for instance, at the cortical level. According to proponents of the Competition Model, evidence for domain-specific language modules is grossly exaggerated, and most localization of language processing that does exist is domain-general in nature and likely emerges as a result of the interaction between the sensory environment and the brain’s uneven computational playing field, rather than being specified in the genes.

It is argued that although grammar is not given in the world, neither is it provided for in the human genome. This approach in particular explains why brain damage in infants and children does not result in long-term deficits which appear as a result of analogous damage to adult brains. Adults have a life-long history of experience neurologically calcified, as a result of Hebbian learning. Children on the other hand have less experience from which to have solidified brain connectivity through stimulus/response-styled strengthening and weakening; in addition, continuing neurogenesis and synaptogenesis permit greater flexibility in attending to novel experiences, even if the resultant pathways may be computationally less efficient than in normals. For these reasons, maturation and learning are considered two aspects of the same events.

The Competition Model presumes that languages differ in the means by which linguistic information is encoded, and further that such differences are as likely quantitative as qualitative. Not only do they differ in their use of specific linguistic features (i.e., lexical tone, morphological inflections) but also in the degree to which various items bear relevant information for listeners. This is shown in cross-linguistic differences in relevance weightings and costs to processing for particular features in conflict with one another (for example: word order, animacy, subject-verb agreement, and gender and number markings used in decisions regarding transitivity). In support of the theory, it appears that the most cost efficient of these features—which can differ significantly from language to language—in terms of processing load and relevance (dubbed cue costs and cue validity), are the least susceptible to disturbance under brain damage, meaning they are most likely to be encoded reduplicatively in the brain. Since aphasic syndromes differ cross-linguistically in the specific deficits they engender—in particular, that these differences reflect the inherent qualitative and quantitative variety among languages—this is taken as evidence that grammar is not innately and universally encoded, but rather based in the brain’s experience of the world.

RELEVANCE TO SPEECH/SONG COMPARISONS
It appears that much of the research involving aphasias has been grossly flawed by preconceived notions regarding the nature of these deficits, as well as over-reliance on generative theories of language. In the literature on prosodic and musical deficits, strikingly these studies are largely based on presumptions of evidence from the more abundant literature on aphasias. If those are flawed then a great deal of the latticework upon which studies regarding neurologically-based deficits in linguistic prosody and the various amusias may collapse.

From the stance that any questions regarding the nature of language and music must be empirically tested, how would research regarding speech prosody and song fit into the scheme of the Competition Model? The literature is littered with hasty conclusions and crass simplifications of the nature of music. Music however, no less than language, appears to be a uniquely human attribute. It is ubiquitous across cultures, and throughout known history, and perhaps more primitive phylogenetically. [1] Just as no chimpanzee has spontaneously begun a dialogue on the nature of altruism, no bonobo has ever played so much as a hollow log or a blade of grass. Fruitless analogies between human song and whale or bird song aside, any continuity between human music and the behaviors of other animals is likely to be found in those aspects of human behavior that are common to both music and language. In particular, I would argue that it is in finding the commonalities between speaking and singing that we are likely to find a large part of the gulf that divides humanity from the rest of nature. And in those features, we will understand the cognitive roots that evolutionarily gave rise to both language and culture.

If adaptations that are claimed for language are not domain-specific, we are likely to find further evidence for this in attempting to define the difference between speech and song. Both are human vocal behaviors. Both leave an acoustic signature, and provide imperfect data to the perceptual apparatus of listeners. In each case, the behavior is most often directed towards or for the benefit of other humans, with an intent to express or communicate ideas or emotions. Further, there are cultural differences regarding which cues carry the most relevant information (i.e., rhythm, melody, divisions of the octave, timbre) that can be analyzed and reliably perceived (though in different ways cross-culturally). Each has aspects of grammar and syntax that are more or less clearly definable. Just as the local choice of phoneme sets varies in arbitrary ways, so too aspects of musical vocabulary vary according to seemingly arbitrary choices. Which features of the acoustic signal segment categorical boundaries vary as much for music as they do for language.

However, there are distinct contrasts between these two domains of human behavior. For instance, language contains a lexicon of semantically-grounded words, whereas music can be, and often is, entirely devoid of propositional meaning. The music in song is apart from the meaning of the words, sometimes independent, at times reinforcing, often contradicting. The musical contribution to song serves in a way to replace the natural prosody of speech. But prosodic aspects of speech contain and convey a great deal of information that is outside the grammar and lexicon of language.

In addition, there is some evidence in the literature for a dissociation between spoken prosody (both lexical and affective) and singing. These studies have used a variety of methodologies (experimental and clinical), and have implicated a multitude of brain regions, from left frontal lobe for lexical prosody (Monrad-Krohn 1947; and Buchanan et al 2000), to right tempoparietal regions (Ross & Mesulam 1979; Ross 1981) for affective prosody, to cerebellum and bilateral motor cortex/posterior inferior frontal gyri for dissociations between speaking and nonverbal singing of melody and rhythm (Riecker et al 2000). Clearly a great deal of study remains to be done.

POINTS FOR FUTURE RESEARCH
How is meaning altered when speech is sung? How do the musical aspects of song figure into the calculations of a listener? Can cue validity and cue cost be separately defined in musical terms? Might this provide further evidence for the case that language processing is in large-part domain-general? Why is it that some aphasics, unable to utter a word of speech, can sing? Is it merely a matter of defining in finer detail the subtle aspects of these deficits? Is there any evidence to sustain dissociations between speaking and singing in comprehension? If there are, I have not yet found any in the literature. If not, it would be rather strange that the production of song, but not its reception, would dissociate from speech.

Likely the anecdotal evidence is skewed by flawed assumptions. Primarily, the issue is confounded by the fact that no one has sufficiently defined the subject matter under investigation. What does it mean to speak, that is different from what it means to sing? If anecdotal evidence supports the claim that brain damaged individuals are able to engage in one but not another of two similar activities, both including the expression of words by the voice, encoded by means of manipulating pitch, duration, amplitude, and timbre, then we need to understand better how these two behaviors differ. Are they two ends of a continuum, or is there a disjunction that divides up the otherwise shared behavior space? How can these matters be tested empirically?

Difficulty arises even in the simplest stages of such research. For instance, there is the nativist argument that brain structures have evolved solely for speech. However, nowhere in the literature is there a clear definition of speech as a solitary act. In fact, speech, like many human behaviors, is a complex of many parts. Without better definitions of the matter under investigation, claims one way or the other are unfalsifiable. Although the necessary distinction between production and perception is normally stipulated, even accounting for this distinction, the remaining behaviors are not simple acts. The perception of speech for instance involves acoustic input to the ears, sent to the primary auditory cortex. A great deal of calculating must go on, however, before the brain will recognize the auditory input as a meaningful signal. Interestingly, there is evidence that the brain early on recognizes human vocal sounds as special (Belin et al 2000), yet this only serves further to link speaking and singing in their uniqueness as stimuli, rather than to distinguish them from each other.

Here is a hypothetical, if entirely speculative, sequence of events: First there is the segmentation of the signal by sources (the “cocktail party effect”). The signal may likely include not only other voices, but environmental sounds as well, which must be filtered out as irrelevant. Next, the signal is parsed into phonemic units, which are further recalibrated based on context (i.e. coarticulation effects, nasalization). Allowances must be made for dialectic and idiolectic variation, for proper categorization of these sounds. In parallel, there will be processing of pitch, intensity and timing. Calculations will go on to determine which aspects of the pitch are local, some relevant for phonemic categorization and others for lexical prominence, and which are more global, and therefore relevant for affective determinations of attitude or judgments on the encoded meanings. Some allowances must be made for individual differences of voice quality, perhaps based on style of speaking or physiological issues such as hoarseness, or lack of muscular control (dysarthria) due to aging or disease. It becomes quickly clear that to speak of a speech act is a polite fiction, if the implication is that such an utterance can be easily qualified and quantified.

For this reason, many of the deficits that appear to affect specific grammatical or lexical processing, may in fact be the result of problems higher along one or another secondary processing pathways. As Bates et al (1998) note: “If we experience two stimuli in exactly the same way, then (by definition) we do not know that they are different.” (p. 599) It follows then that what can be distinguished in normals, or dissociated in pathologies are somehow different in terms of brain processing. Surely, there are many distinctions that the brain is incapable (or disinclined) to notice. For instance, sharp boundaries do exist in perception for graded acoustic events, such as the categorical boundary for the phonemes /b/ and /p/; and as noted in Bates (in press, p. 8 ), this appears not to be a species-specific phenomena. The same is likely true for categorical perception of colors.

The point is: graded phenomena in the world can be perceived as disjunct by living brains. Where brains fail to make a distinction, the phenomena are for our purposes categorically the same. It is by identifying and quantifying the features used by brains that we will come to understand how seemingly equivalent behaviors do in fact differ, likewise how apparently different behaviors may utilize shared processes in the brain. Therefore the task of specifying dissociations is largely a matter of determining the level of processing at which each dissociation occurs. If these levels are consistent across subjects, they can be viewed as universal brain mechanisms (without regard at this point for whether they are innate or emergent). Where they differ, it is likely the result of individual differences (perhaps based in experience or native abilities) or failure to specify the stimuli with sufficient detail. In many cases, the technology for such fine-grained distinctions may not yet exist.

FOOTNOTES

[1] This is a contentious point. Some have argued that music is not universally understood and appreciated by individuals across cultures. Others have noted that not all cultures have a native music. For example Southern Popaluca has been cited in this regard. Southern Popalucan music is all borrowed from Spanish and popular Mexican traditions. On the one hand, such cases may be the exceptions that prove the rule. However, and more deeply indicative is the question regarding what features distinguish music from language. Inherent in all spoken languages are manipulations of timing, intonation, and timbre, which are features shared in common between musical and linguistic phenomena. Arguably, even signed languages, while lacking sound, contain similar and analogous features, as has been argued by Sherman Wilcox among others.
REFERENCES

BATES, E. “On the nature and nurture of language.” (in press). In R. Levi-Montalcini, D. Baltimore, R. Dulbecco, & F. Jacob (Series Eds.) & E. Bizzi, P. Calissano, & V. Volterra (Vol. Eds.), Frontiere della biologia [Frontiers of biology]. The brain of homo sapiens. Rome: Giovanni Trecanni. [Prepublication version].

BATES, E., DEVESCOVI, A., & WULFECK, B. (2001). Psycholinguistics: a cross-language perspective. Annual Review of Psychology. Chippewa Falls, WI: Annual Reviews.

BATES, E., et al (1998). “Innateness and emergentism.” In W. Bechtel & G. Graham (Eds.), A Companion to Cognitive Science (pp. 590-601). Malden, MA and Oxford: Blackwell Publishers.

BELIN, P., et al. 2000. “Voice-selective areas in human auditory cortex.” Nature 43 (20 January 2000): 309-312.

BUCHANAN, T. W., et al. 2000. “Recognition of emotional prosody and verbal components of spoken language: an fMRI study. Cognitive Brain Research 9: 227-238.

MONRAD-KROHN, G. H. (1947). “Dysprosody or altered ‘melody of language’.” Brain 70, 405-415.

RIECKER, A., et al. (2000). “Opposite hemispheric lateralization effects during speaking and singing at motor cortex, insula and cerebellum.” NeuroReport 11 (9), 1997-2000.

ROSS, E. D. (1981, Sep). “The aprosodias: Functional-anatomic organization of the affective components of language in the right hemisphere.” Archives of Neurology 38, 561-569.

ROSS, E. D. & MESULAM, M.-M. (1979). “Dominant language functions of the right hemisphere? Prosody and emotional gesturing.” Archives of Neurology 36, 144-148.

Comments

« Previous entries Next Page » Next Page »
Register Login
Locations of visitors to this page