The reliability of pause as a cue in speech

A question has recently come up regarding the reliability of pause as a cue to the segmentation of speech into intonational, or semantic meaning groups [1]. A few years ago, I had prepared a paper in conjunction with a colleague, Pentti Haddington, which addressed the question of the unreliability of pause in this context (click here for PowerPoint Slide Show). In our findings, pause was neither sufficient nor necessary by itself as a cue. Rather, pause sometimes co-appeared with other cues, and the conjunction of these cues together served to demark segmentation.

There is an important distinction that must be made between two types of pauses: the silent pause, which is perhaps what is most commonly referred to by the term; and the filled pause. The filled pause can be seen as a lengthening, or as a hesitation (each likely with its own causes and meaning). I believe that the filled pause, and hesitation are likely more reliable cues. The question then is how might one automatically extract the acoustic signatures of these cues, in order to use them for parsing in speech recognition?

Is anyone working on these issues?

[1] See for example, Seligman, M. “Nine Issues in Speech Translation,” Machine Translation, v. 15, no. 1/2, June 2000, pp. 149-186. This specific issue is discussed in section 5.

Leave a Comment

You must be logged in to post a comment.

Register Login
Locations of visitors to this page