Revising the definition of timbre to make it useful for speech and musical sounds

From CNBH Acoustic Scale Wiki

Jump to: navigation, search
Category:Perception of Communication Sounds

Roy Patterson , Etienne Gaudrain, Tom Walters, Jessica Monaghan

The text and figures that appear on this page of the wiki were prepared in support of a paper presented at the BSA meeting in York (Sept 2008).

Figure 1: The waveform and spectrum of a child's /a/ vowel

The purpose of this paper is to draw attention to the definition of timbre as it pertains to the vowels of speech and the sustained notes of music, and the contrast between the way the definition treats the two different forms of acoustic scale information in communication sounds. The two forms are illustrated using a vowel sound, /a/, in Figure 1. The shape of the envelope of the magnitude spectrum (blue line) determines vowel type in speech perception (e.g. /a/ vs /i/), and instrument family in music perception (e.g. strings vs brass). The position of the fine-structure of the magnitude spectrum on a log-frequency axis is the acoustic scale of the excitation component of the sound. The fine structure is the set of green lines and they are collectively heard as the pitch of the sound. The position of the spectral envelope on a log-frequency axis is the acoustic scale of the filter component of the sound. It affects the apparent size of a singer (woman vs child), or the apparent size of an instrument within a family of instruments (e.g., trumpet vs trombone, or violin vs cello). The two acoustic scale variables jointly determine our perception of the overall size of the source, and the register of the singer or instrument. The variables are similar in form and have the same units (frequency or wavelength). However, they are treated differently by the definition of timbre. Whereas the position of the fine structure (pitch) is considered to be separate from timbre, the position of the envelope (filter size), is considered to be an aspect of timbre, according to the current definition.

In this paper, we review the effects of acoustic scale on the perception of communication sounds and argue that it is important to note the role of the position of the envelope explicitly in our discussions of timbre.

Smith and Patterson (2005) Figure 2 Figure 2

Contents

Timbre

“Timbre is that attribute of auditory sensation in terms of which a listener can judge that two sounds similarly presented and having the same loudness and pitch are dissimilar.” [“American standard acoustical terminology” (1960). American National Standards Institute ]

Informally, everyone laughs at this definition because it is a hollow shell. Although the authors have provided what purports to be a definition of timbre, they do not seem to know what timbre is; they just know that there are a couple of things that it is not. It is not pitch, it is not loudness, and it is not duration. It is everything else.

Despite the poverty of the ‘definition’, it appears in most popular introductory books on hearing and auditory perception, and aside from the definition, these books actually have rather little to say on the topic of timbre. This is somewhat surprising given how important the concept is in music and speech. Timbre is what distinguishes a trumpet and a violin playing the same note at the same loudness and for the same duration, and timbre is what distinguishes vowels sung by the same person on the same note at the same loudness and for the same duration. It is a very important concept in hearing and it is, perhaps, time to consider revising the definition of timbre, at least as it pertains to communication sounds.

The waveform and spectrum of a child’s /a/

The aspects of timbre that are important in this paper are readily illustrated with the sustained sounds of speech, that is, sustained vowels. The main effects of interest are

The principles are the same for musical instruments that produce sustained notes (see The perception of musical notes and instruments).

The waveform and spectrum of a short segment of a synthetic /a/ vowel, as spoken by a child, are presented in the upper and lower panels of Figure 1 above. The waveform shows that a vowel is a stream of glottal pulses, each of which is accompanied by a decaying resonance that reflects the filtering of the vocal tract above the larynx. The glottal pulse rate (GPR) is 200 pulses per second (pps) in this case, so the time between glottal pulses, which is the ‘period’ of the wave, is 5 ms in this case.

The set of green vertical lines in the lower panel of Figure 1 shows the long-term magnitude spectrum of the sound, and the bold blue line connecting the tops of the vertical green lines shows the spectral envelope of the vowel. The peaky structures in the spectral envelope are the formants of the vowel; the shape of the envelope in the spectral domain corresponds to the shape of the damped resonance in the time domain. In many cases, the peak occurs at a position in the spectrum where there is no energy but this does not seem to be a problem for the auditory system (although it is for automatic speech recognition machines).

The changes imposed by development on the waveform and spectrum of a vowel

When children begin to speak they are about 0.85m tall and as they mature their height increases by about a factor of two. Vocal tract length increases with height and so the formant frequencies of the child's vowels decrease by about an octave as they mature. It is also the case that their GPR decreases by about an octave as their vocal cords become longer and more massive. These effects are straightforward examples of the physical fact that large objects vibrate more slowly than small objects; see Patterson et al. (2008) for examples from speech and music.

Waveforms

Figure 3: Waveforms illustrating the affect of acoustic scale on the pitch period and resonance rate of vowel sounds

In speech, the pattern of formants that defines a given vowel type is effectively unchanged as people grow up (Peterson and Barney, 1952; Turner et al., 2009). The effects of growth on the waveform and the spectrum of the /a/ vowel are illustrated in Figure 3 and Figure 4. The vowel of the child is in panel (b) of both figures and that of a large adult is in panel (c). Comparison of panels (b) and (c) in Figure 3 shows that the GPR of the adult is an octave lower than that of the child (the period between pulses is double in the case of the adult). The comparison also shows that the resonance rate of the adult is slower, that is, the zero-crossings of the resonance are longer. These are the main effects of the increase in body size as observed in the waveform. Recent research suggests that other factors like the oral-pharyngeal length ratio have little effect on the pattern of formants for a specific vowel (Turner et al., 2009).

Spectra

Figure 4: Magnitude spectra illustrating the affect of acoustic scale on the fine structure and envelope of vowel sounds

The spectra of the vowels of the child and the adult are presented in panels (b) and (c) of Figure 4. The spectra are plotted on a logarithmic frequency scale, and on this scale, the set of harmonics that define the fine structure of the spectrum simply moves, as a unit, towards the origin as the child matures into an adult (by one octave in this example). That is, the dilation of the spectral fine structure associated with a change in voice pitch on a linear frequency axis, produces a simple shift of the pattern on a logarithmic frequency axis. The comparison shows that the spectral envelope also shifts towards the origin without changing shape, and it shifts by the same amount as the fine structure in this example.

The other two panels show that the fine structure and envelope variables are largely independent. Comparison of panel (c) with panel (a) shows that the fine structure can shift down towards the origin under the envelope without affecting the envelope. Comparison of panel (c) with panel (d) shows that the envelope can shift down towards the origin over the fine structure without affecting the fine structure.

In summary, there are two elements to the size information in vowels and other communication sounds, one associated with the fine structure of the spectrum and one with the envelope. They both increase as a child grows up, but they are theoretically independent and they play different roles in the timbre of vowel sounds and musical notes.

The perception of a vowel over the course of development

Now consider the definition of timbre and the question of how we perceive the physical changes that take place in a vowel, like /a/, as a child matures. The logic of the definition of timbre involves specifying the variables of auditory perception that do not affect timbre – chiefly duration, loudness and pitch – and the associated physical variables, intensity, frequency and time. Listen to the first demonstration sound. It is a sequence of eight notes in which the size of the vocal cords and the length of the vocal tract change by up to 25% between notes -- sometimes one changes, sometimes the other, sometimes both. The question is whether there are timbre changes between some of the notes? It is clear that the syllable is the same for all of the notes, so the gross timbre properties clearly have not changed.

Demo 1: a sequence of small changes in the sound of the syllable 'la'.

Download [620.20 kB]


Sometimes the pitch changes between notes; this is clearly not a timbre change. Sometimes the singer seems to change between notes; the question is whether this is a timbre change?

Duration

Duration is the variable that is most obviously separate from timbre, and it illustrates the logic underlying the definition of timbre. If a singer holds a note for a longer rather than a shorter period, it produces a discriminable change in the sound but it is not a timbre change. Duration has no effect on the magnitude spectrum of a sound (once the duration is well beyond that of the temporal window used to produce the magnitude spectrum). Since the current example involves sustained vowels and the window used to produce the magnitude spectrum is on the order of 50 ms, duration has no effect on timbre in this example. In general, the perceptual change associated with a change in the duration of a note is separable from changes in the timbre of the note.

Intensity and Loudness

If the child or adult puts more effort into their vocalization, the sound becomes louder, and if the intensity of all of the components increases by the same relative amount (the same number of decibels), then, the change will be perceived as an increase in loudness. The pitch of the vowel and the timbre of the vowel will be largely unaffected by the manipulation. The increase in intensity produces a change in the magnitude spectrum of the vowel – both the fine structure and the envelope shift vertically upwards – but there is no change in the frequencies of the components of the fine structure and there is no change in the relative amplitudes of the harmonics, or the shape of the spectral envelope. So, loudness is not a part of timbre, or loudness is separable from timbre. To wit, if you change the volume control when playing the audio demos on this page, it does not change the timbre of the individual sounds.

Repetition Rate and Pitch

The pitch of a vowel or a musical note is the psychological correlate of the repetition rate of the waveform (Helmholtz, 1875), or the frequency spacing of the harmonics, or the fundamental of the harmonic sieve that best fits the magnitude spectrum, or the acoustic scale of the fine structure of the spectrum. If a person sings a syllable twice and changes the tension of their vocal folds between the first and second syllable, the repetition rate changes, and we hear a change in the voice pitch. But there is no change in syllable type. The definition of timbre indicates that the shifting of the spectral fine structure associated with the change in pitch does not produce a change in the timbre of the sound, which seems reasonable since there is no change in syllable type. So, just as repetition rate, or harmonic spacing, is separable from spectral envelope shape, so pitch is largely separable from timbre. Note, however, that the harmonics change their relative amplitude as they shift along under the envelope, and the changes in relative amplitude can be quite substantial as the harmonics pass through the region of a formant.

The second audio demonstration illustrates that vocal changes that occur as a man with a long vocal tract sings an eight note melody, during which the pitch drops by an octave from about 200 to 100 pulses per second. This descending melody is within the normal range for a tenor, and the melody sounds reasonably natural.

Demo 2: Melody descending by an octave over the course of eight notes, with a tall person singing

Download [620.20 kB]


As the melody proceeds, the position of the spectral envelope does not change, as in the bottom row of Figure 4. The fine-structure proceeds to move to the left over the course of the melody, and the shift in position is like the shift that occurs between panels (d) and (c) of Figure 4. The definition of timbre indicates that this relatively large change in pitch does not produce a change in the timbre of the sound, and this seems a reasonable description of our perception of the melody in this case.

The effect is somewhat different if a small child sings the melody with the same pitch values; that is if we shift the spectral envelope up an octave and excite the corresponding vocal tract transfer function with the same sequence of pulse trains as in the previous example.

Demo 3: Melody descending by an octave over the course of eight notes, with a short person singing

Download [620.20 kB]


During the first three notes of the melody, we are inclined to hear a simple change in the pitch of the child's voice. The starting pulse rate is low for the voice of a small child but not outside the normal range. As the melody proceeds, however, and the pitch decreases by a full octave, the voice quality seems to change and the child comes to sound more like a dwarf. The definition of timbre indicates that large shifts in pitch do not produce timbre changes, even when they produce changes in the apparent source of the sound. This would appear to say that changes in the apparent source of a sound are not timbre changes, when they do not produce a change in the spectral envelope of the sound. This would be a problem for the current definition of timbre, because changes in who is heard to be singing are normally regarded as timbre changes. If a woman and a child sing a syllable on the same note and at the same loudness, the change in source is readily discriminable and it is considered to be a timbre change.

Spectral envelope shape and timbre

The definition of timbre does not say anything specific about how changes in the spectral envelope affect timbre, but it gives the impression that any change in the spectral envelope that produces an audible change in the perception, that is not a loudness change or a pitch change, produces a change in timbre. The definition seems perfectly reasonable, but it means that a simple shift in the envelope of the magnitude spectrum will, by definition, produce a change in timbre. This seems at odds with what we hear. As a child grows up, the length of the vocal tract increases linearly with their height and the frequencies of the formants of their vowels decrease in inverse proportion. For a change in height of more than about 5%, we hear a change in the size of the singer, but there is no change in vowel type.

The fourth demonstration illustrates the change in perception that occurs as the child's vocal tract increases by a factor of two to the length appropriate for an adult, using a sequence of length ratios that have the same numerical values as the sequence of GPR ratios used to produce the melody for demos 2 and 3.

Demo 4: Vocal-tract length increasing by a factor of two over the course of eight notes

Download [620.20 kB]


As the envelope shifts by an octave, the child seems to get larger and the voice comes to sound something like that of a counter tenor, that is, a tall person with a high pitch. The definition of timbre suggests that the change in the spectrum that produces the perceived change in the size of the singer, has produced a change in the timbre of the sound, because it is not a change in pitch and it is not a change in loudness. This does not, immediately, seem reasonable inasmuch as there is no change in vowel type. It seems like we may need to add another dimension to our model or auditory perception, namely, the perceived size of the source, and agree that shifts in the spectral envelope that preserve envelope shape (and pitch and loudness) produce a change in perceived size rather than timbre. In other words, source size is perceptually separable from the characteristic sound of the source, and if timbre means the characteristic sound of the source, then timbre is separable from source size in auditory perception.

Note, that during this fourth demonstration, the harmonics change their relative amplitude as the envelope shifts past them, and the changes in relative amplitude can be quite substantial as a formant peak passes any given harmonic. It is only the spectral envelope that remains fixed in this demonstration.

Timbre and Acoustic Scale

The GPR-VTL plane with Musical Notation

A schematic, musical representation of the notes used to create the demonstrations is presented in Figure 5. The abscissa is the GPR of the note, or the position of the spectral fine structure; the ordinate is the position of the spectral envelope, or the acoustic scale of the envelope -- a variable that is closely related to the vocal-tract length (VTL) of the singer. Both dimensions are logarithmic, in the sense that the notes specify the ratios, or interval relationships, between the notes on their respective dimensions. The notes were all synthesized from the voice of an adult male with a VTL of about 16.5 cm and an average GPR of about 120 pps, so the range of the GPR octave is from about 100 to 200 pps, and the range of the VTL change, or 'VTL octave,' is from about 20 to 10 cm. The note corresponding to the original singer is [E,E] on this version of the GPR-VTL plane.

Figure 5: The GPR-VTL plane with musical notation

The second demonstration, where the GPR of the voice drops an octave over the course of the melody and the singer is a tall person, was produced with the notes in the 'E' row of the plane, specifically [{C, A, G, E; G, E, D, C}, E].

The third demonstration, where the GPR of the voice drops an octave over the course of the melody and the singer is a small person, was produced with the notes in the upper 'C' row of the plane, specifically [{C, A, G, E; G, E, D, C}, C].

The fourth demonstration, where the VTL of the voice drops an octave over the course of the melody was produced with the notes in the righ-hand 'C' column of the plane, specifically [C, {C, A, G, E; G, E, D, C}].

The first demonstration was more complicated, involving changes in both GPR and VTL; the sequence of notes was (G,G), (E,G), (E,E), (F,F); (G,G), (G,E), (E,E), (F,F). So the two phrases take two different routes from (G,G) to (E,E), and then both move to (F,F) for the final note of the phrase.

During development, there is a strong correlation between the growth of the vocal cords and the growth of vocal tract length, and the voices along the diagonal in the GPR-VTL plane sound more natural than those in any one row or column. The effect is illustrated in Demo 5 in which the melody is played along the diagonal with the two variables changing in step.

Demo 5: Melody descending by an octave along both dimensions over the course of eight notes

Download [620.20 kB]


The manipulations are heard to reinforce each other. The sequence has a melody that descends an octave, and there is a progressive increase in the perceived size of the singer, with one momentary reversal at the start of the second phrase.

Acoustic scale and the position of the fine structure and envelope

In acoustic terms, 'the position of the fine-structure of the spectrum on a logarithmic frequency scale' is the acoustic scale of the sound produced by the source of excitation which is the vocal folds in the case of humans. For brevity, it will be referred to as ‘the scale (S) of the source (s)’ and designated Ss. Similarly, in acoustic terms, 'the position of the envelope on a logarithmic frequency scale' is the acoustic scale of the filter in the vocal tract above the larynx that produces the vocal tract resonance. For brevity, it will be referred to as ‘the scale (S) of the filter (f)’ and designated Sf. Acoustic scale is a physical a property of a sound as it occurs in the air between the singer and the listener. In the case of the source of excitation, it is the wavelength corresponding to the period of the glottal oscillation. This acoustic variable stands in contrast to the physiological variables of the vocal folds that cause them to vibrate at a specific rate, like their mass, length and tension. It also stands in contrast to psychological variables like the pitch that we perceive in response to a wave with a given glottal period. It is important to identify this variable when modelling the perception of communication sounds because Ss is the information that the auditory system extracts from the sound wave, and the information on which the pitch of the perception is based.

The plane of notes in Figure 5 was introduced as the GPR-VTL plane, and this is a useful description when thinking about the source of the notes. The labels on the axes, however, are musical notation for the GPR ratios of the notes of the diatonic muscial scale, which is a psychological description of the meaning of the intervals between notes to the cognitive processing of music. The appropriate description of the acoustic variables in the sounds that represent the notes is 'the scale of the source' rather than GPR or 'E' on the abscissa, and the scale of the filter rather than VTL on the ordinate. Thus, in acoustic terms, the GPR-VTL plane is the Ss-Sf plane.

Acoustic scale is useful in explaining the perception of source size and the interaction of GPR and VTL in the generation of the perceived size of the singer, which is why it is introduced at this point in the discussion. The scale of the source and the scale of the filter both increase with body size and they interact to produce our perception of the size of the singer (Smith and Patterson, 2005). Broadly speaking, perceived size is largest in the region of the diagonal in the GPR-VTL space of acoustic scale; from any point along the diagonal, perceived size gets smaller as you go up or to the right, and it gets larger as you go down or to the left.

Neither of these manipulations produces a change in the message of the sound – it is a ‘la’ sung by a human no matter what the combination of GPR and VTL. Accordingly, it is argued that timbre would better be defined as the shape of the spectral envelope on a logarithmic axis independent of its position on the axis, with both of the acoustic scale dimensions, Ss and Sf, being regarded as separate from timbre.

The fact that the two aspects of acoustic scale interact in determining source size, with an increase in one being counteracted by a decrease in the other, is another reason for arguing that the scale of the filter should be treated like the scale of the source and excluded from the definition of timbre.

Whispered vowels and the Dual Profile

There are several more stimuli that assist in delineating the relationship between the acoustical properties of sound and the perception of its timbre -- specifically, whispered vowels and sinusoidal vowels.

Whispered speech and acoustic scale

Consider what is meant by the timbre of whispered speech and the effects of acoustic scale on the perception of whispered speech. Listen to Demo 6. It contains a sequence of eight whispered 'notes' with the same temporal structure as in the previous demonstrations. In this case, the word 'note' is being used as it would for the beats of a drum. We can distinguish something about the size of the drum from the sound but the sequence of notes does not convey a melody.

Demo 6: A sequence of whispered 'notes'

Download [620.18 kB]

There is a series of perceptual changes over the course of the sequence, and the voice at the end is distinctly different from the voice at the start, as in previous demonstrations. It is also the case that all of the 'notes' are perceived to be versions of the syllables as in the previous demonstrations. So something about the timbre of these whispered notes is the same as it was for the voiced notes. But the whispered and voice notes sound decidedly different and the difference seems to be one of timbre.

In whispered speech, the vocal tract filter is excited by a noise source rather than a stream of glottal pulses. The turbulent noise that is the source of the energy in the sound has an acoustic scale but the value is not well marked the way it is when the syllable is voiced. Moreover, the scale of the source does not change from note to note. So, the notes in the whispered 'la' sequence do not have pitch in the sense of an identifiable pulse rate, and the sequence of notes does not form a melody in the sense of an identifiable sequence of GPR ratios, or Ss values.

The noise produced by the excitation source in the larynx, is filtered as it passes through the vocal tract and so the spectral envelope exhibits resonant peaks similar to those for the corresponding segment of voiced speech, and this is, arguably, why we hear the syllable 'la' in response to each of the notes in the sequence. In Demo 6, Sf, the acoustic scale of the filter, shifts by an octave over the course of the sequence, and it is these changes in Sf that produce the sequence of changes we hear in the perception of the sequence. This suggests that what we are perceiving is a sequence of changes in singer size, rather than a musical melody, and this hypothesis seems compatible with the perception of the sequence.

Listen to Demo 6 again. Compare it with Demo 7 and think about the following questions:

  1. What is the direction of the perceptual progression in Demo 7?
  2. Is the direction the same as in Demo 6?
  3. What is the difference between the two demonstrations?

Demo 6: A sequence of whispered 'notes'

Download [620.18 kB]

Demo 7: A second sequence of whispered 'notes'

Download [620.18 kB]

The answers to the questions would appear to be:

  1. The source gets larger as the progression proceeds in Demo 7, and it is obvious.
  2. The direction is the same in the two demonstrations and it is obvious.
  3. One of the notes is different in Demo 7, but it is not obvious which one or what the difference is.

Although the answers to the first two questions are simple, they show, nevertheless, that the acoustic scale of the filter is perceived as a dimension of sound, like the acoustic scale of the source, and the direction of the dimension is not arbitrary. This is compatible with the argument that we perceive Sf in terms of the size of the singer and there is something basic about the perception. All people hear it the same way, and a change of an octave in Sf is perceived as a large change in singer size. No practice is required to understand the direction of the dimension. The scale of the filter seems to be a primitive aspect of auditory perception, like the scale of the source.

The answer to the third question is not at all obvious. It would be if the stimulus were with voiced speech and Ss was being manipulated instead of Sf. Then you would have heard that the last note of the first phrase in Demo 7 is F, whereas in all of the previous demonstrations it was E. In a musical melody this would be a substantial change. The fact that it is not obvious in the Sf sequence, indicates that the perception of Sf is not one of musical pitch.

The examples involving whispered speech add support to the hypothesis that the shape of the spectral envelope determines the timbre of vowel sounds, in the sense of determining vowel type, and the acoustic scale of the vocal tract filter affects the perceived size of the speaker but not the timbre of the vowel. At the same time, the examples indicate that there is another dimension to timbre that involves the temporal regularity of the excitation source, and the properties of timbre associated with the excitation source are perceptually distinct from those associated with the resonant filter. Moreover, in both cases, the acoustic property of scale, be it Ss or Sf, is perceived to be separate from the timbre of the sound.


The dual profile representation of timbre and acoustic scale

Ives et al. (2005) have described a ‘dual profile’ representation of the information in the neural activity pattern (NAP) produced by a sound in the auditory nerve. The distribution of activity along the tonotopic dimension of the cochlea is summarized in a spectral profile and the distribution of time intervals in the neural activity is summarized in a temporal profile. The dual profile is a combination of the two in which both are plotted on the same logarithmic frequency axis and adjusted to have roughly the same size. The concept of the dual profile was originally described by Bleeck and Patterson (2002). Both Ives et al. (2005) and Bleeck and Patterson (2002) used the dual profile to compare the relative value of spectral and temporal information as bases for the calculation of the pitch of the sound. The dual profile can assist in explaining the perception of whispered vowels and in developing a more useful model of timbre, in which acoustic scale is segregated from timbre itself.

Demo 2: Changing Ss

Figure 6: The Dual Profile: variation in Ss

The top panel of Figure 6 presents the dual profile for the first note in the Demo 2: Melody descending by an octave over the course of eight notes, with a tall person singing. The spectral profile is the solid blue line; the temoral profile is the green line. The abscissa is the tonotopic axis of auditory perception. The remaining three panels of the figure show the dual profiles for the fourth, fifth and eighth notes of the melody. So the figure shows the dual profiles for the first and last notes of the first phrase of the melody (upper two panels), and the dual profiles for the first and last notes of the second phrase of the melody (lower two panels). Remember that, while the pitch of the notes descends monotonically in both phrases, the first note of the second phrase is higher than the last note of the first phrase.

Demo 2: Melody descending by an octave over the course of eight notes, with a tall person singing

Download [620.20 kB]


It is argued that the envelope of the spectral profile determines those aspects of the timbre associated with the resonance filter, for example, the species of the sender for an animal call, vowel type for speech, and instrument family for music. The acoustic scale of the filter, Sf, determines the position of the activity in the spectral profile along the tonotopic axis. The temporal profile determines those aspects of the timbre associated with the excitation source, for example, whether a vowel is voiced or whispered. When the temporal profile exhibits a series of distinct peaks decreasing towards lower frequencies, the sound has a strong pitch. The acoustic scale of the source, Ss, determines the position of the activity in the temporal profile along the tonotopic axis. The postion of the largest peak, specifies the pitch we hear.

In Demo 2 (the melody sung by a tall person), the resonance filter is fixed, and so the distribution of activity in the spectral profile is broadly similar in the four panels of Figure 6. It is the GPR that changes between notes of the melody, and it is the series of peaks in the temporal profile that shift from panel to panel in Figure 6. The position of the largest peak shows the pitch of the melody at the start and end of each phrase.

Demo 4: Changing Sf

Figure 7: The Dual Profile: variation in Sf

In Demo 4, the pitch is fixed and the size of the singer is percieved to increase over the course of the melody. The acoustic scale of the filter, Sf, changes according to the sequence of interval ratios in the melody. In Figure 7, the distribution of activity in the spectral profile is observed to shift from panel to panel, in much the same way as the temporal profile shifted across the panels of Figure 6. So, the position of the spectral envelope, Sf, mirrors the increase in singer size over the course of each phrase of the melody and the fact that the singer at the start of the second phrase is smaller than the singer at the end of the first phrase.

Demo 4: Vocal-tract length increasing by a factor of two over the course of eight notes

Download [620.20 kB]


Demo 5: Coordinated manipulation of Ss and Sf

Figure 8: The Dual Profile: variation in Ss and Sf

In Demo 5, the pitch is perceived to descend by an octave while the singer increases substantially in size over the course of the melody. In Figure 8, the dual profiles of the notes show that the temporal profile shifts from panel to panel, in the same way as it does in Figure 6, and the spectral profile shifts from panel to panel, in the same way as it does in Figure 7. That is, the shapes of the temporal and spectral profiles describe the properties of the timbre of the sound associated with the excitation source and resonance filter, respectively, while the position of the activity in the temporal and spectral profiles reflects the acoustic scale of the excitation source and resonance filter, respectively.

Demo 5: Melody descending by an octave along both dimensions over the course of eight notes

Download [620.20 kB]


The dual profile representation of whispered vowels

Figure Figure 9: The Dual Profile: variation in Sf with whispered speech

The sequence of dual profiles for the initial and final notes of the two phrases of the whispered speech Demo 6 are presented in Figure 9. The whispered speech demonstrations (6 and 7) showed that the shape of the spectral envelope determines the timbre of whispered vowels, and the acoustic scale of the vocal tract filter affects the perceived size of the speaker but not vowel type. The spectral profiles in the four panels of Figure 9 have the same shape and position as their couterparts in Figs 6, which shows the dual profiles for the voiced version of the same demonstration. Thus, the dual profile can explain the perception of vowel type and the changes in source size associated with the scale of the filter, Sf. The whispered speech demonstrations also revealed that there is another dimension to timbre which involves the temporal regularity of the excitation source, and that the properties of timbre associated with the excitation source are perceptually distinct from those associated with those of the resonant filter. The temporal profiles in Figure 9 do not have the series of peaks characteristic of voice pitch; the shape of the profile is essentially random and it changes from panel to panel. Thus, the temporal profile would appear to provide a useful description of the pitch of whispered vowels inasmuch as it provides a clear indication that there is no pitch and the sound produced by the source is constantly changing as is the case with noise.

In summary, the dual profile would appear to provide the basis of a much more useful description of the pitch and timbre of communication sounds than the ANSI definitions of pitch and timbre. Specifically, the dual profile provides an integrated representation of the information in the sound -- a representation in which the components associated with distinct aspects of the perceptions produced by the sounds are represented separately in the dual profile as properties of one of the component profiles. Vowel type appears in the shape of the envelope of the spectral profile; the acoustic scale of the filter appears as the position of the activity in the spectral profile. When the sound has a pitch it appears as a distinct pattern of peaks in the spectral profile. The acoustic scale of the source appears as the position of the pattern in the temporal profile; the relative height of the peaks specifies the pitch strength, or the degree of voicing. Moreover, when one of these basic properties changes in the sound and in the perception, the change in the dual profile is largely isolated by the corresponding property of the appropriate profile.

Sinusoidal vowels and acoustic scale

If the frequencies and amplitudes of three sinusoids are set to the formant frequencies of a vowel, the resulting synthetic sound can, under certain circumstances, convey vowel-type information, despite the fact that the sound quality is distinctly different from that of a natural vowel. These synthetic vowels are referred to as 'sinusoidal vowels' (Remez et al., 1981; Remez et al., 2001) and they can assist in illuminating the relationship between the physical and acoustical properties of sounds, on the one hand, and their timbre, on the other hand.

Listen to Demo 8, in which a sinusoidal vowel is used to play through the sequence of ratios used to construct Demos 2-6. In the starting note, the three sinusoids have the same frequencies as the formants of the first vowel in Demo 5. Then, between the first note, C, and the second note, A, the frequencies of the 'formants' in the vowel are all reduced by three semitones, and so on for each step in the rest of the sequence.

Demo 8: A sequence of sinusoidal vowels with the same sequence of ratios as in Demos 2-6

Download [620.20 kB]


The sequence produces the perception of a strange instrument with a rather shrill timbre playing a version of the melody heard in Demos 2-6, but in a different key. The fact that the three sinusoids come on and go off together repeatedly causes them to be grouped together in the perception. This, combined with the fact that we hear a tonal melody, is probably what gives us the impression that we are listening to mechanical source that is probably intended to be some kind of musical instrument. The sinusoids do not form a harmonic series, and it is this which imparts the shrill component to the timbre.

Now, compare the timbre of the sinusoidal vowels to the natural vowels in Demo 5, and compare the pitch ranges of the two sequences in the sense of the octave in which they occur.

Demo 5: A melody descending by an octave along both scale dimensions over the course of eight notes

Download [620.20 kB]


It is not immediately clear what the vowel type is, despite the fact that it is presented eight times in the sequence, and it is not immediately clear whether the source is increasing in size as the sequence proceeds. In short, our model of the source of this sound is vague. The vowel /a/ was immediately clear in both the voiced vowels and the whispered vowels of the previous demonstrations, and it was clear that the last 'la' was sung by a much bigger person than the first 'la'.

Figure 10: Acoustic scale in sinusoidal vowels

The dual profile representation of pitch and timbre can be used to relate the properties of the perception produced by this 'instrument' to the acoustical properties of the sound. There is no excitation source associated with the production of these sinusoidal vowels, and so the amplitudes of the sinusoids are not modulated and the vowels do not have the voice pitch characteristic of natural vowels. This is reflected in the temporal profiles of the sinusoidal vowels (Figure 10); the distinctive pattern of peaks produced by communication sounds is absent from the temporal profiles. The melodic component of the perception appears to be produced by the sinusoid that marks the position of the lowest formant of the vowel. Its frequency in the starting note of Demo 8 is 1192 Hz (the second and third sinusoids start at 1780 Hz and 3985 Hz, respectively). The GPR of the starting note in Demo 5 is 196 pps, two and a half octaves below that of the sinusoidal pitch. The fact that the three 'formants' of the sinusoidal vowel come on and go off together causes them to be grouped together in the perception. It is not clear, however, from auditory theory why the pitch component of the perception is dominated by the lowest sinusoid, nor why the upper formants which are entirely resolved, are not heard as separate from the first formant. The upper two sinusoids are not harmonics of the lowest sinusoid, so when they are combined with the lowest sinusoid to produce the timbre of the sound, they give the timbre an 'edge' or shrillness.

The manipulation that produces the sequence of notes causes the three spectral lines to shift, as a unit on a logarithmic frequency scale, in one direction or the other as dictated by the sequence of ratios that define the sequence. So, the manipulation is similar to the acoustic scale shifts described above involving the spectral fine structure and spectral envelope. The manipulation is, in fact, an acoustic scale shift, in the sense that the manipulation is like changing the sampling rate of a recording of three sinusoids; their frequencies are scaled by the ratio of the sampling rates. However, there is no excitation mechanism associated with the production of the vowels (as noted above), and so the vowels do not have an Ss value of the form defined above for the excitation source of a communication sound. Similarly, there is no spectral envelope in the usual sense of the word, and no Sf value of the form defined above for the resonant filter of a communication sound. The magnitude spectrum is composed of three spectral lines with no information to indicate how the envelope should be filled in between the line components. The spectral profile is continuous between the sinusoids, but this function simply reflects the shape of the auditory filters used to create the profile, and the two-tone suppression that sharpens the filter output.

The perception of the vowel-type information in sinwave vowels can be brought to the fore by contrasting vowel type within a sequence, while maintaining a fixed VTL. For example, listen to the randomly ordered set of sinwave vowels in Demo 9.

Demo 9: A sequence of different sinwave vowels with fixed VTL

Download [568.52 kB]


After a few notes, the brain focuses on the contrast between notes and the vowel-type component of the perception comes to the fore. However, the source does not sound like a human with a specific size, and it does not sound like the source is changing size in the way it does in the voiced and whispered demonstrations. Note also that the perception includes an odd melodic component, which seems to mean that we are hearing the sinusoids jump around between vowels.

The effect of tension on the scale of the source

The acoustic scale of the fine structure, Ss, is also affected by the tension of the vocal folds, and it is, of course, the tension that singer’s vary to vary the pitch of their voice and produce a melody. So the average pitch of a singer or speaker over the course of a song or a paragraph is provides information about the size of the person; it is a global property of the individual’s voice. Local changes in pitch within the song or the paragraph are heard as melody or prosody.


References

Personal tools
Namespaces
Variants
Views
Actions
Navigation