Archive for Terminology

How to make a talking moose

I spoke with a prominent sound designer for animated features last night. He posed a rather intriguing problem: How do you make a talking moose sound organically like a talking moose? How do we create a voice that would represent a talking moose? How do we put the acoustic filters in place to take a voice and make it sound as if the human speech organs were inside the resonant cavity of a moose?

The point is, what’s needed at the moment is to devise for sound the same sorts of tool set that computer graphic designers have at their disposal. We need to develop the tool set for sound manipulation that produces true organic-sounding products. We don’t need to create the sounds wholecloth. Think of photo-manipulating software. We’ve got things to start with. We can make the recordings. The problem is how do we manipulate the sound without creating all sorts of digital noise? How do we make the filters that change a moose into a goose into a hedgehog, and how do we take a fast-speaking New Yorker, and make them sound like a Georgian, or better yet, how do we produce a filter to speak French with a Russian accent?

It is a problem whose resolution will depend on pulling together the right team of people, from a variety of backgrounds, using a variety of approaches. We need to understand what goes into the sounds in the first place that creates the identity of a fast-talking, angry, New York cabbie or a slow-talking, treacly Atlanta land salesman. What are the features of a Russian speaking French that differ from those of a native speaker? I’ll give you a hint: It’s not as simple as the phoneme set. So, we need some people to take apart the real organic sounds, while we’ve got others working on putting them back together. There’s a great deal of work being done on the latter half, but very little on the former. It’s time to put them together.

This will be done. It’s just a question of who, and when.

Comments

Automatic voice feature extraction

Elliott D. Ross and colleagues have long studied the impact of particular right hemisphere neuropathologies on affective speech prosody, syndromes collectively termed the aprosodias. (See the Song, Speech, and Brain bibliography for some details). If we develop the tools for automatic extraction of voice features (ones that would be necessary to produce animated synthetic voices), it would be possible to see a future where audio recordings of patients speaking would become a normal part of a medical file. These audio recordings could be subject to automatic analysis and extraction of individual voice features. A comparison from such baseline recordings with post-event recordings could provide cues to identifying neuropathologies that might be otherwise missed. They could also serve as a method for the analysis and quantification of dysarthria and other voice affecting disorders.

These features as well must certainly play a role in voice identification and verification systems. The problem at hand is finding a way to automatically (and reliably) extract these features of voice (timing, pitch, phonemes/allophonic variation, timbre) and to classify them for analysis and comparison.

Such systems could go beyond medical applications as well. There is no reason why automatic extraction of features couldn’t be applied for military and intelligence applications, to quickly identify dialects and languages, or be able to recognize an impostor, someone speaking a non-native dialect or language. These systems could also be used for pedagogical purposes to assist learners in acquiring a near-native accent in a foreign language, by providing a better understanding of the features common to native speakers, and analysis and feedback on the learner’s production. This would be a giant stride forward from the overly simplistic acoustic language learning tools (that provide too literal a comparison from model to learner), which are currently available.

Is anyone working on developing these tools? I’ve heard nothing. Anyone interested?

Comments

The reliability of pause as a cue in speech

A question has recently come up regarding the reliability of pause as a cue to the segmentation of speech into intonational, or semantic meaning groups [1]. A few years ago, I had prepared a paper in conjunction with a colleague, Pentti Haddington, which addressed the question of the unreliability of pause in this context (click here for PowerPoint Slide Show). In our findings, pause was neither sufficient nor necessary by itself as a cue. Rather, pause sometimes co-appeared with other cues, and the conjunction of these cues together served to demark segmentation.

There is an important distinction that must be made between two types of pauses: the silent pause, which is perhaps what is most commonly referred to by the term; and the filled pause. The filled pause can be seen as a lengthening, or as a hesitation (each likely with its own causes and meaning). I believe that the filled pause, and hesitation are likely more reliable cues. The question then is how might one automatically extract the acoustic signatures of these cues, in order to use them for parsing in speech recognition?

Is anyone working on these issues?

[1] See for example, Seligman, M. “Nine Issues in Speech Translation,” Machine Translation, v. 15, no. 1/2, June 2000, pp. 149-186. This specific issue is discussed in section 5.

Comments

Realistic Voice synthesis and natural speech comprehension

Here is a question out to my readers: Is anyone developing a realistic system of voice synthesis, that takes into account the prosody, especially the melody and rhythm, of natural speech? On the other end, what work is being done to facilitate machine comprehension of natural speech, in particular the meaning of speech prosody?

Read the rest of this entry »

Comments

The Competition Model and its Relevance for Speech/Song Research

Jonathan G. Secora Pearl
Department of Linguistics
University of California, Santa Barbara

Corresponding address:

Jonathan Pearl
Music & Language Studies
7220 N. Rosemead Blvd., Suite 202-10
San Gabriel, CA 91775

email: type”jonathan@musiclanguage.net”

ABSTRACT

The emerging field of music and language studies draws on the traditions and techniques of linguistics and musicology, with an empirical and cognitive bent. The present paper examines the relevance of the Competition Model from psycholinguistics on research that straddles the territories of speech prosody and music, in particular addressing the production and perception of the musical aspects (pitch, timing, amplitude, and timbre) of human vocal sounds.

INTRODUCTION

The Competition Model is an emergentist model for human language. It assumes that human brains develop according to a genetically-specified though plastic plan, which includes certain preferences in computing style arising in particular regions or pathways of the brain, as a result of native architectural and timing mechanisms. This is in contrast with nativist theories that implicitly presume innate representations, of grammar for instance, at the cortical level. According to proponents of the Competition Model, evidence for domain-specific language modules is grossly exaggerated, and most localization of language processing that does exist is domain-general in nature and likely emerges as a result of the interaction between the sensory environment and the brain’s uneven computational playing field, rather than being specified in the genes.

It is argued that although grammar is not given in the world, neither is it provided for in the human genome. This approach in particular explains why brain damage in infants and children does not result in long-term deficits which appear as a result of analogous damage to adult brains. Adults have a life-long history of experience neurologically calcified, as a result of Hebbian learning. Children on the other hand have less experience from which to have solidified brain connectivity through stimulus/response-styled strengthening and weakening; in addition, continuing neurogenesis and synaptogenesis permit greater flexibility in attending to novel experiences, even if the resultant pathways may be computationally less efficient than in normals. For these reasons, maturation and learning are considered two aspects of the same events.

The Competition Model presumes that languages differ in the means by which linguistic information is encoded, and further that such differences are as likely quantitative as qualitative. Not only do they differ in their use of specific linguistic features (i.e., lexical tone, morphological inflections) but also in the degree to which various items bear relevant information for listeners. This is shown in cross-linguistic differences in relevance weightings and costs to processing for particular features in conflict with one another (for example: word order, animacy, subject-verb agreement, and gender and number markings used in decisions regarding transitivity). In support of the theory, it appears that the most cost efficient of these features—which can differ significantly from language to language—in terms of processing load and relevance (dubbed cue costs and cue validity), are the least susceptible to disturbance under brain damage, meaning they are most likely to be encoded reduplicatively in the brain. Since aphasic syndromes differ cross-linguistically in the specific deficits they engender—in particular, that these differences reflect the inherent qualitative and quantitative variety among languages—this is taken as evidence that grammar is not innately and universally encoded, but rather based in the brain’s experience of the world.

RELEVANCE TO SPEECH/SONG COMPARISONS
It appears that much of the research involving aphasias has been grossly flawed by preconceived notions regarding the nature of these deficits, as well as over-reliance on generative theories of language. In the literature on prosodic and musical deficits, strikingly these studies are largely based on presumptions of evidence from the more abundant literature on aphasias. If those are flawed then a great deal of the latticework upon which studies regarding neurologically-based deficits in linguistic prosody and the various amusias may collapse.

From the stance that any questions regarding the nature of language and music must be empirically tested, how would research regarding speech prosody and song fit into the scheme of the Competition Model? The literature is littered with hasty conclusions and crass simplifications of the nature of music. Music however, no less than language, appears to be a uniquely human attribute. It is ubiquitous across cultures, and throughout known history, and perhaps more primitive phylogenetically. [1] Just as no chimpanzee has spontaneously begun a dialogue on the nature of altruism, no bonobo has ever played so much as a hollow log or a blade of grass. Fruitless analogies between human song and whale or bird song aside, any continuity between human music and the behaviors of other animals is likely to be found in those aspects of human behavior that are common to both music and language. In particular, I would argue that it is in finding the commonalities between speaking and singing that we are likely to find a large part of the gulf that divides humanity from the rest of nature. And in those features, we will understand the cognitive roots that evolutionarily gave rise to both language and culture.

If adaptations that are claimed for language are not domain-specific, we are likely to find further evidence for this in attempting to define the difference between speech and song. Both are human vocal behaviors. Both leave an acoustic signature, and provide imperfect data to the perceptual apparatus of listeners. In each case, the behavior is most often directed towards or for the benefit of other humans, with an intent to express or communicate ideas or emotions. Further, there are cultural differences regarding which cues carry the most relevant information (i.e., rhythm, melody, divisions of the octave, timbre) that can be analyzed and reliably perceived (though in different ways cross-culturally). Each has aspects of grammar and syntax that are more or less clearly definable. Just as the local choice of phoneme sets varies in arbitrary ways, so too aspects of musical vocabulary vary according to seemingly arbitrary choices. Which features of the acoustic signal segment categorical boundaries vary as much for music as they do for language.

However, there are distinct contrasts between these two domains of human behavior. For instance, language contains a lexicon of semantically-grounded words, whereas music can be, and often is, entirely devoid of propositional meaning. The music in song is apart from the meaning of the words, sometimes independent, at times reinforcing, often contradicting. The musical contribution to song serves in a way to replace the natural prosody of speech. But prosodic aspects of speech contain and convey a great deal of information that is outside the grammar and lexicon of language.

In addition, there is some evidence in the literature for a dissociation between spoken prosody (both lexical and affective) and singing. These studies have used a variety of methodologies (experimental and clinical), and have implicated a multitude of brain regions, from left frontal lobe for lexical prosody (Monrad-Krohn 1947; and Buchanan et al 2000), to right tempoparietal regions (Ross & Mesulam 1979; Ross 1981) for affective prosody, to cerebellum and bilateral motor cortex/posterior inferior frontal gyri for dissociations between speaking and nonverbal singing of melody and rhythm (Riecker et al 2000). Clearly a great deal of study remains to be done.

POINTS FOR FUTURE RESEARCH
How is meaning altered when speech is sung? How do the musical aspects of song figure into the calculations of a listener? Can cue validity and cue cost be separately defined in musical terms? Might this provide further evidence for the case that language processing is in large-part domain-general? Why is it that some aphasics, unable to utter a word of speech, can sing? Is it merely a matter of defining in finer detail the subtle aspects of these deficits? Is there any evidence to sustain dissociations between speaking and singing in comprehension? If there are, I have not yet found any in the literature. If not, it would be rather strange that the production of song, but not its reception, would dissociate from speech.

Likely the anecdotal evidence is skewed by flawed assumptions. Primarily, the issue is confounded by the fact that no one has sufficiently defined the subject matter under investigation. What does it mean to speak, that is different from what it means to sing? If anecdotal evidence supports the claim that brain damaged individuals are able to engage in one but not another of two similar activities, both including the expression of words by the voice, encoded by means of manipulating pitch, duration, amplitude, and timbre, then we need to understand better how these two behaviors differ. Are they two ends of a continuum, or is there a disjunction that divides up the otherwise shared behavior space? How can these matters be tested empirically?

Difficulty arises even in the simplest stages of such research. For instance, there is the nativist argument that brain structures have evolved solely for speech. However, nowhere in the literature is there a clear definition of speech as a solitary act. In fact, speech, like many human behaviors, is a complex of many parts. Without better definitions of the matter under investigation, claims one way or the other are unfalsifiable. Although the necessary distinction between production and perception is normally stipulated, even accounting for this distinction, the remaining behaviors are not simple acts. The perception of speech for instance involves acoustic input to the ears, sent to the primary auditory cortex. A great deal of calculating must go on, however, before the brain will recognize the auditory input as a meaningful signal. Interestingly, there is evidence that the brain early on recognizes human vocal sounds as special (Belin et al 2000), yet this only serves further to link speaking and singing in their uniqueness as stimuli, rather than to distinguish them from each other.

Here is a hypothetical, if entirely speculative, sequence of events: First there is the segmentation of the signal by sources (the “cocktail party effect”). The signal may likely include not only other voices, but environmental sounds as well, which must be filtered out as irrelevant. Next, the signal is parsed into phonemic units, which are further recalibrated based on context (i.e. coarticulation effects, nasalization). Allowances must be made for dialectic and idiolectic variation, for proper categorization of these sounds. In parallel, there will be processing of pitch, intensity and timing. Calculations will go on to determine which aspects of the pitch are local, some relevant for phonemic categorization and others for lexical prominence, and which are more global, and therefore relevant for affective determinations of attitude or judgments on the encoded meanings. Some allowances must be made for individual differences of voice quality, perhaps based on style of speaking or physiological issues such as hoarseness, or lack of muscular control (dysarthria) due to aging or disease. It becomes quickly clear that to speak of a speech act is a polite fiction, if the implication is that such an utterance can be easily qualified and quantified.

For this reason, many of the deficits that appear to affect specific grammatical or lexical processing, may in fact be the result of problems higher along one or another secondary processing pathways. As Bates et al (1998) note: “If we experience two stimuli in exactly the same way, then (by definition) we do not know that they are different.” (p. 599) It follows then that what can be distinguished in normals, or dissociated in pathologies are somehow different in terms of brain processing. Surely, there are many distinctions that the brain is incapable (or disinclined) to notice. For instance, sharp boundaries do exist in perception for graded acoustic events, such as the categorical boundary for the phonemes /b/ and /p/; and as noted in Bates (in press, p. 8 ), this appears not to be a species-specific phenomena. The same is likely true for categorical perception of colors.

The point is: graded phenomena in the world can be perceived as disjunct by living brains. Where brains fail to make a distinction, the phenomena are for our purposes categorically the same. It is by identifying and quantifying the features used by brains that we will come to understand how seemingly equivalent behaviors do in fact differ, likewise how apparently different behaviors may utilize shared processes in the brain. Therefore the task of specifying dissociations is largely a matter of determining the level of processing at which each dissociation occurs. If these levels are consistent across subjects, they can be viewed as universal brain mechanisms (without regard at this point for whether they are innate or emergent). Where they differ, it is likely the result of individual differences (perhaps based in experience or native abilities) or failure to specify the stimuli with sufficient detail. In many cases, the technology for such fine-grained distinctions may not yet exist.

FOOTNOTES

[1] This is a contentious point. Some have argued that music is not universally understood and appreciated by individuals across cultures. Others have noted that not all cultures have a native music. For example Southern Popaluca has been cited in this regard. Southern Popalucan music is all borrowed from Spanish and popular Mexican traditions. On the one hand, such cases may be the exceptions that prove the rule. However, and more deeply indicative is the question regarding what features distinguish music from language. Inherent in all spoken languages are manipulations of timing, intonation, and timbre, which are features shared in common between musical and linguistic phenomena. Arguably, even signed languages, while lacking sound, contain similar and analogous features, as has been argued by Sherman Wilcox among others.
REFERENCES

BATES, E. “On the nature and nurture of language.” (in press). In R. Levi-Montalcini, D. Baltimore, R. Dulbecco, & F. Jacob (Series Eds.) & E. Bizzi, P. Calissano, & V. Volterra (Vol. Eds.), Frontiere della biologia [Frontiers of biology]. The brain of homo sapiens. Rome: Giovanni Trecanni. [Prepublication version].

BATES, E., DEVESCOVI, A., & WULFECK, B. (2001). Psycholinguistics: a cross-language perspective. Annual Review of Psychology. Chippewa Falls, WI: Annual Reviews.

BATES, E., et al (1998). “Innateness and emergentism.” In W. Bechtel & G. Graham (Eds.), A Companion to Cognitive Science (pp. 590-601). Malden, MA and Oxford: Blackwell Publishers.

BELIN, P., et al. 2000. “Voice-selective areas in human auditory cortex.” Nature 43 (20 January 2000): 309-312.

BUCHANAN, T. W., et al. 2000. “Recognition of emotional prosody and verbal components of spoken language: an fMRI study. Cognitive Brain Research 9: 227-238.

MONRAD-KROHN, G. H. (1947). “Dysprosody or altered ‘melody of language’.” Brain 70, 405-415.

RIECKER, A., et al. (2000). “Opposite hemispheric lateralization effects during speaking and singing at motor cortex, insula and cerebellum.” NeuroReport 11 (9), 1997-2000.

ROSS, E. D. (1981, Sep). “The aprosodias: Functional-anatomic organization of the affective components of language in the right hemisphere.” Archives of Neurology 38, 561-569.

ROSS, E. D. & MESULAM, M.-M. (1979). “Dominant language functions of the right hemisphere? Prosody and emotional gesturing.” Archives of Neurology 36, 144-148.

Comments

Rhythm in Music and Speech

Rhythm appears to be a fundamental capacity of humans. Rhythm plays a role in the prenatal environment and the early socialization of infants (Bertoncini, et al., 1995; Fassbender, 1996; Hargreaves, 1986; Papoušek, 1996). It has been implicated in the coordination of motor activity and locomotion (Iverson & Thelen, 1999). Rhythmic processing is a late deteriorating function in neurodegenerative diseases, such as Alzheimer’s (Beatty, et al., 1999). Rhythm appears to be a basic element in the construction of more complex human behaviors and interactions, such as music and language (Iverson & Thelen, 1999; Patel, et al., 1998), and has been implicated in aspects of memory (Brower, 1993; Payne & Holzman, 1986; Patel, et al., 1998).

A greater understanding of rhythm processing will therefore benefit from joint explorations across these domains of human behavior, in particular in music and language because of their universal presence across cultures and throughout the lifespan. Both music and speech share the same acoustic medium. Both are processed by the same perceptual apparatus. I find it reasonable to assume that the cognitive heuristics used for making sense of music and speech are at least similar, because we lack sufficient evidence to suggest that humans have evolved two entirely different mental modules for music and for language. To the contrary, there is great evidence to suggest that the distinction between music and speech is only achieved at higher levels of processing (Patel, et al., 1998).

There are many aspects of temporal processing that are relevant for this examination, and which necessarily impact an understanding of the subject. Unfortunately, well-formed and agreed upon definitions are in short supply. Paul Fraisse (1982), for instance, has written: “The task of those who study rhythm is a difficult one, because a precise, generally accepted definition of rhythm does not exist.” (149) What’s more, the definitions that occasionally arise lack consistency in what they describe. In an attempt to clarify and tease apart the various aspects of temporal organization, I provide my own definitions of certain aspects, which I trust are no less nor more arbitrary than most. I make no attempt however to be exhaustive in these definitions, in part because there appear to be many equally valid ways to divide up the temporal domain. I merely seek a first approximation of terms to address those aspects which will most facilitate questions dealing jointly with music and speech. Read the rest of this entry »

Comments

Denoting the Voice: Text and Context in Music and Language

Denoting the Voice: Text and Context in Music and Language

Jonathan G. Secora Pearl
Fellowship proposal, submitted to the NEH

The Problem

Charles Darwin was wrong, at least about music. In “The Descent of Man,” he wrote: “As neither the enjoyment nor the capacity of producing musical notes are faculties of the least use to man in reference to his daily habits of life, they must be ranked amongst the most mysterious with which he is endowed.” (Darwin, C. 2004 [1879]: 636) One might have expected more, knowing his wife Emma was a fine pianist, who in her youth had studied in Paris with Frédéric Chopin. Generations of scholars, from outside the field of music, have compared it to other human behaviors, and found it lacking, a mere artifice, insubstantial, ornamental, irrelevant. Some have dismissed it as a byproduct of something ostentibly more useful to the species, like language. (Pinker, S. 1997: 528) To hold that music is useless, but that language is not, one must understand how they differ. It is a simple thing to claim they are not alike, but far harder in practice to define the ways. Music and language remain twin aspects of civilization, found in all known human cultures, across time and place, embracing us from our earliest days until the ends of our lives. Speaking and singing are found everywhere and everywhen. Wherein lies the distinction?

The greatest difficulty in answering this foundational question is that we are often deceived by written forms of music and language into believing our object dwells within them, rather than in the sounds that inspire them. On the page, they appear far more distinct than they do in sound.Text without context is a world without air; yet context alone remains the unanalyzable chaos of everyday experience. The trick is to find the balance between too much detail, and too little. Most important is a self-reflective understanding of the specifics regarding what each system captures and what it leaves out. Standard Western music notation gives preference to pitch classes and length, dealing more with intention than with execution. Written language may highlight phonetic details and word order at the expense of intonation and timing. Comparing music and language in these forms is speaking at cross-purposes. Read the rest of this entry »

Comments

Foreign accent syndrome

[Update pending. Look for review of Kurowski, Blumstein, and Alexander (1996).]

What has been dubbed foreign accent syndrome was first described by Monrad-Krohn in 1947,1 in which he presented the case of a woman who suffered a shrapnel wound in WWII, that damaged portions of the left hemisphere of her brain. Her ability to produce and comprehend language was mostly spared, except for the odd effect to her speech prosody that others perceived as a foreign accent. In that particular case, sounding German in Oslo just following WWII was not an easy thing.

What must be pointed out however is that no one ever has been reported in the neurological literature spontaneously, or as a result of head injury, to have begun speaking a foreign tongue. The term foreign accent syndrome, as well as some of the descriptions that have accompanied the term, is a bit of a misnomer, in that it implies the patients of FAS somehow acquire the accent of a particular foreign language. Rather, the perception of hearers is that the prosody is somehow off, leading them to entertain the theory that the speaker is non-native in the language. Read the rest of this entry »

Comments

Terminology page updated

Comments

Polysemy

Polysemy is a term referring to items with multiple meanings, and the ambiguity these multiple meanings create. In language, this describes words that might be interpreted variably in different contexts.

There is a bug crawling on the screen.

The FBI placed a bug in the ambassador’s office.

Caution: these olives have been mechanically processed. The occasional pit may be present.

The trap consisted of branches and leaves, covering a large pit.

I love the colors of the trees in fall.

Be careful not to fall in.

The term could easily be applied as well to the nature of pivot chords in musical modulations. For instance, modulating from the key of c-minor to E-flat-major, one might find a g-minor triad serving as a pivot. The three notes g/b-flat/d are diatonic pitches in both keys. The chord is therefore ambiguous (or polysemous) since it could resolve in various ways. This musical resolution can be considered as similar to the semantic resolution or parsing that takes place in the mind of a language user.

Comments

Child-directed speech

Child-directed speech (CDS) is the speech of caregivers to infants and children. It is a particular speech register which is characterized by modifications to prosody, and simplifications in lexical and syntactic choices. It is unknown how widespread each of these features are, of to what extent this register is a universal feature of languages. Some cultures have been reported to lack such a register. However, it is likely that aspects of prosodic modification and the simplification of lexical and syntactic choices can be found, even in cultures where these features are subtle.

This register has been known by several names: motherse, infant-directed speech, baby talk.

Comments

Anacrusis

Anacrusis is one of those few terms that is shared between musicology and linguistics. The word derives from the Greek meaning an “up-stroke” (Oxford English Dictionary, 1989, s.v. anacrusis). Musically, an anacrusis has been described as “[o]ne or more notes preceding the first metrically strong beat of a phrase; upbeat, pickup.” (Randel, 1986, s.v. Anacrusis) Linguistically, the term has similarly been used to refer to a series of unaccented syllables at the beginning of a stretch of speech, normally uttered with a quickened tempo (Chafe, 1994, p. 59; Cruttenden, 1996, p. 21; Du Bois, Cumming, Schuetze-Coburn, & Paolino, 1992, p. 100). It is an open but empirically testable question whether or not anacrusis in music is normally characterized by this same quickening of tempo.

References

Chafe, W. (1994). Discourse, Consciousness, and Time: The flow and displacement of conscious experience in speaking and writing. Chicago & London: The University of Chicago Press.

Cruttenden, A. (1996). Intonation, 2nd Ed. Cambridge, UK: Cambridge University Press.

Du Bois, J., Cumming, S., Schuetze-Coburn, S. & Paolino, D., Eds. (1992). Santa Barbara Papers in Linguistics, v. 4: Discourse Transcription. Santa Barbara: University of California, Santa Barbara, Department of Linguistics.

Randel, D. (1986). New Harvard Dictionary of Music. Cambridge, MA & London: Belknap Press of Harvard University Press.

Comments

Register Login
Locations of visitors to this page