The Perception of Acoustic Scale and Source Size

From CNBH Acoustic Scale Wiki

Jump to: navigation, search
An Introduction to Auditory Perception

This Part of the book describes a series of perceptual experiments with communication sounds which show what everyone intuitively knows; auditory perception is singularly robust to changes in both the resonance rate and the pulse rate of communication sounds [2,3,4,5]. The robustness of auditory perception stands in contrast to the lack of robustness of automatic speech recognition (ASR) systems; a speech recognizer trained on the speech of a man is typically not able to recognize the speech of a woman, let alone the speech of a child. Robustness to changes in GPR and VTL, which we take for granted and think of as trivial, far from trivial in ASR where the pre-processor begins by converting the speech wave into a time-frequency representation similar to the spectrogram. In this case, the only way that the recognizer can deal with scale variability is to learn the distributions from speech data. In effect, the system has to learn the form of each vowel for each value of GPR and each value of VTL. This means that the training sets have to be enormous, which in turn means that the training times become very long and development of new algorithms becomes prohibitive.

The introduction to Part 1 illustrated the changes that occur in the waveforms and magnitude spectra of vowels when there are changes in GPR and/or VTL (Figs 1.1xxx and 1.2xxx). Figure 3.1 presents a similar set of vowels that vary in GPR and VTL. In this case, they are synthetic, "two-formant" vowels designed to illustrate the principles of auditory processing that are the focus of this Part of the book. The message of these vowels is that the vocal tract is in the /a/ shape currently, and this message is contained in the shape of the resonance which is the same in each cycle of each vowel. The bandwidths of the formants are proportional to their center frequencies in this example, and this means that the upper formants decay faster than the lower ones in each case. The two formants can be observed interacting in the first couple of cycles at the start of each resonance; thereafter, the resonance of the lower formant dominates the waveform.

Figure 3.1. Four versions of a synthetic ‘a’ vowel illustrating 1) the effect on the wave of a change in vocal tract length from shorter (upper row) to longer (lower row), and 2) the effect of a change in pitch (glottal pulse rate) from lower (left column, 31 Hz) to higher (right column, 62 Hz). In each case, the wave is composed of three glottal cycles.

The vowels in Fig. 3.1 are readily discriminable but they all convey the same message, /a/. Similarly, the experiments show that we have no difficulty whatsoever understanding when a child and an adult have spoken the same the word [5], despite substantial differences in their pitches and vocal-tract lengths. We also know which speaker has the higher pitch and which speaker is bigger (i.e., which speaker has the longer vocal tract). Perceptual experiments have been performed with vowels [3], syllables [4], musical notes [2] and animal calls [6]; they all lead to the conclusion that auditory perception is singularly robust to the scale variability in communication sounds. It is also the case that the robustness of human perception extends to speech sounds and musical sounds where the pulse rate and resonance rate are scaled well beyond the range of normal experience [3], and scaled independently. The results suggest that the auditory system contains a mechanism that automatically adapts to the GPR and VTL of the sound, and produces some form of invariant representation of the message [7].

The form of communication sounds and the robustness of auditory perception suggest that the auditory representation of communication sounds is somehow `shift covariant' both to changes in resonance scale (scale-shift covariance), and to changes in glottal-pulse period (period-shift covariance). If the space of auditory perception already has these properties, then more central mechanisms do not need to learn about the variability associated with GPR and VTL differences. This would help explain our facility with speech and our ability, as children, to learn speech from people of widely differing sizes.

This Part of the book begins with a simple illustration of how the auditory image can be made shift covariant, that is, how the dimensions of the auditory image can be transformed so that the distribution of information that constitutes the message of a communication tone can be rendered in a form that does not change with changes in GPR and VTL. If the space of auditory perception is itself shift covariant, then more central mechanisms that deal with the learning, memory and recognition of communication sounds will not need to develop complex statistical models concerning the variability associated with GPR and VTL that arises in {time, frequency} space. This would go some way to explaining our facility with speech and language, and our ability as children to learn language from people of widely differing sizes speaking with ever changing pitch.


Contents

Shift covariance: The key to perceptual robustness

The concept of shift covariance and its value in modelling auditory perception can be illustrated using simulated auditory images of the four synthetic vowels of Fig. 3.1; the auditory images are shown in Fig. 3.2. The dimensions of the plane are scale and cycles in {log2(cycles), log2(scale)} form. The important points to note at this juncture are the following: The pattern of activation that represents the message – the auditory figure – has a fixed form in all four panels; it does not vary in shape with changes in pulse rate or resonance rate. When there is a change in resonance rate (vocal tract length) from small to large, the activity just moves vertically down as a unit without deformation; compare the auditory figures in the panels of the upper row to those in the panels of the lower row. The extent of the shift is the logarithm of the ratio of the resonance rates of the sources. When there is a shift in pulse rate from longer to shorter, the auditory figure does not change shape and it does not move vertically; rather, the diagonal which marks the extent of the pulse period, moves vertically without changing either its shape or its angle. This changes the size of the area available for auditory figures, but does not change the shape of the figure, other than to cut off the tails of the resonances when the pulse period is short relative to the resonance duration. The extent of the vertical shift of the diagonal is the logarithm of the ratio of the pulse rates of the sources. Since the pulse rate and the resonance rate were both scaled by the same amount in this example, the vertical shift (between rows) of the auditory figure and the vertical shift of the period-limiting diagonal (between columns) are the same size.

Figure 3.2: Scale-shift covariant auditory images of the four synthetic vowels in Fig. 1 Note that the shape of the distribution is the same all four main panels, despite the differences in VTL and GPR. The representation is scale-shift covariant for both VTL and GPR.Figure 4: Scale-shift covariant auditory images of the four synthetic vowels in Fig. 1 Note that the shape of the distribution is the same all four main panels, despite the differences in VTL and GPR. The representation is scale-shift covariant for both VTL and GPR.

Comparison of panel (a) with panels (b) and (c), and comparison of panel (d) with panels (b) and (c), show that the shift covariance property applies separately to the pulse rate and resonance rate, albeit in a somewhat different form. The independence is important because, although pulse rate and resonance rate typically covary with size as members of a population of animals grow, the correlation in growth rate is far from perfect. For example, in humans, both pulse rate and resonance rate decrease as we grow up; however, whereas VTL is closely correlated with size throughout life, in males, pulse rate takes a sudden drop at puberty.

The figure also illustrates one of the constraints on pulse-resonance communication; although the auditory image itself is rectangular, the auditory figures of pulse-resonance sounds are limited to the upper triangular half of the plane. The lowest component of a sound cannot have a resonance rate below the pulse rate of the sound.

Intuitively, this auditory plane and the auditory figures that arise in it provide a better basis for a model of auditory perception than the traditional {time, frequency} space and the traditional spectrographic representation of speech, inasmuch as the auditory space appears to be able to explain why it is that we can hear the message of a communication sound independent of the size of the sender. If this is the space of auditory perception rather than a {time, frequency} space, and if the mechanisms are inherited and develop as a part of the auditory system, this would help to explain how children manage to learn speech in a world where the samples they experience come from people of such different sizes (parents and siblings). It would also help explain how animals with much smaller brains and shorter life spans than those of humans manage to cope with the size variability in the communication sounds or their species and other species.

The figure also suggests that it may well be possible to produce a speech pre-processor for ASR systems that automatically performs GPR and VTL normalization at the syllable level which should improve ASR performance considerably.

So, how might the auditory system construct a {cycles, scale} space? And how might the auditory system have evolved this form of signal processing?


Work mark noon 17 Apr 2010

In AIM, the operation of the cochlea is simulated with a compressive, gammachirp, auditory filterbank [9], in which filter centre frequency is distributed logarithmically with frequency and the bandwidth of the auditory filter is proportional to filter centre frequency. The envelope of the impulse response is a gamma function whose duration decreases as frequency increases. This means that, mathematically, the operation of the cochlea is more like a wavelet transform [10,11] than a windowed Fourier transform, where the window duration is the same in all channels. This suggests that the auditory system is actually transforming the time waveform of sound into a {time, scale} representation [7,12], rather than a {time, frequency} representation. The {time, scale} versions of the four vowels in Fig. 1 are presented in Fig. 2. The activity in the four panels includes the operation of the hair cells and so it is intended to be a record of the neural activity pattern (NAP) flowing from the cochlea up the auditory nerve to the brain stem in response to each of these sounds. By its nature, the time dimension in this representation is linear and the NAP exists in a {time, scale} space [7].

Figure 2: Neural activity patterns (NAPs) for the four synthetic vowels in Fig. 1. This version of the NAP has dimensions {linear-time, log-scale}. As a result, formant duration decreases as scale increases. Thus, the representation is not scale-shift covariant.

The right-hand ordinate of each panel is acoustic scale, and it is reciprocally related to acoustic frequency through the speed of sound; scale = (1/frequency)*velocity. So {time, scale} space is similar to {time, frequency} space; the right-hand ordinate is the more familiar variable, frequency, in log format. As a variable, however, scale has the advantage that it is more directly related to the wavelength of the sound, and thus, to the size of the resonators in the source. The unit of frequency is cycles per second and the unit of the velocity of sound in air is meters per second, so the unit of scale is meters per cycle. Thus, the NAP pattern in each row of one of these panels is showing the individual cycles of the mechanical vibration produced by a resonator in the source, and the ordinate gives the scale of the cycle, that is, the number of meters per wavelength. This is directly related to the physical size of the resonator that produced this component of the sound, and since resonator size is typically correlated to body size [13], the {time, scale} NAP of a sound contains valuable information about the size of the source.

Although the {time, scale} representation of communication sounds is scale covariant [14], it is not scale-shift covariant. This is illustrated in Fig. 2 by the fact that a change in resonance size produces a change in the shape of the distribution, as well as its position in this time-scale space. The upper formant in each sound is a scaled version of the lower formant, but the representation of the upper formant is compressed in time with respect to the lower formant. Similarly, the distribution of activation expands in time, as a unit, when we switch from the smaller sources in the upper panels to the larger sources in the lower panels. This is the same effect in a different form. The scale information is preserved and it covaries with the dilation of the auditory figure in this {time, scale} representation. But the changes are not orthogonal; a change of acoustic scale in the sound produces a change along the time dimension of the figure as well as in the scale dimension. So, to repeat, the {time, scale} representation of communication sounds is scale covariant [14], but it is not scale-shift covariant.

To achieve scale-shift covariance, we need to expand time as scale decreases so that the unit of scale (one cycle) becomes the same size in each channel. This is accomplished simply by multiplying time as it exists in each channel of the NAP of a sound (e.g., the individual rows of the panels of Fig. 2) by the centre frequency of the channel in question [7]. The unit of this new dimension is [cycles/second] X time, which reduces simply to cycles (of the resonance). This transformation is motivated by a consideration of the operators that can transform {time, scale} space into a scale-shift covariant space – operators which are, at the same time, unitary. Such operators effect coordinate transformations that preserve physical properties like energy, and they have an inverse that enables transformation in the reverse direction. The expressions for the operator, S and its inverse, S-1 are

(Sf)(\gamma,\sigma) :=  (ln2)2^{\frac{\gamma}{2}}f(2^{\gamma-\sigma},2^{\sigma})

and

(S^{-1}f)(t,s)=(ln2)^{-1}(ts)^{-\frac{1}{2}}f(log_2{ts},log_2{s}),

where γ is log2(cycles), or log2c and σ is log2(scale), or log2s. The operator indicates that if we use logarithmic scale units, then the shift of the neural pattern with a change in scale will be restricted to the vertical dimension [7]; that is, it will be orthogonal to the cycles dimension. Similarly, if we use logarithmic cycle units, then the shift of the period-terminating diagonal with a change in pulse rate will also be restricted to the vertical dimension; that is, it too will be orthogonal to the cycles dimension. The expression for the operator says that the function, f, in time-scale space (on the right-hand side), is transformed into the function ‘Sf’ in {γ,σ} space by a particular exponentiation of time and scale. The normalization constant is required to make the transformation unitary. The important point for the current discussion is that this operator fixes the shape of the distribution of activation associated with the message of the sound, so that it no longer changes shape when either the pulse rate or the resonance rate change. When the {time, scale} NAPs in Fig. 2 are transformed to {γ,σ} space, they appear as in Fig. 3.

Figure 3: Neural activity patterns (NAPs) for the four synthetic vowels in Fig. 1. This version of the NAP has dimensions {log-cycles, log-scale}.The shape of the distribution within a cycle does not change with vocal-tract length or pitch, but successive cycles have different shapes.

There are two striking differences between the NAPs as they appear in this new space and the more familiar {time, scale} space. Firstly, there is scale-shift covariance – the property that motivates the change of space. The activity in the first cycle of each NAP is essentially the same in all four panels. Specifically: (1) The pattern of activation that represents the message – the auditory figure – has a fixed form in all four panels; it does not vary in shape with changes in pulse rate or resonance rate. (2) When there is a change in resonance rate, the auditory figure just moves vertically, as a unit without deformation. (3) When there is a change in pulse rate, the auditory figure does not change shape and it does not move vertically; rather, the diagonal which marks the extent of the pulse period, moves vertically without changing either its shape or its angle. The other difference is that the cycles of the sound beyond the first are rotated by the time-to-cycles transformation, and compressed by the transformation from cycles to log2(cycles). The onset of the second cycle of the pattern defines the boundary of the auditory image; it is a positive diagonal (45 degree angle) and its position is defined by the scale of the pulse period, which is indicated by the point where the diagonal intersects the ordinate. The period in panel (c) is 32 ms, so σ is 5 The units are positive, because the coordinate transformation is applied to a time-scale space, where ‘scale’ is the inverse of scale as normally used in wavelet transforms. The start of the third cycle of activity is a parallel, positive diagonal that is shifted down by an octave, and so it would intersect the ordinate at 4. The rotation and progressive compression of the auditory figure in the second and third periods of the sound indicate that {γ,σ} space is not time-shift covariant. That is, successive copies of a sound have different forms. So the progression of auditory figures across this {γ,σ} NAP does not represent time as we perceive it in auditory perception. Auditory perception is time-shift covariant in the sense that we hear the same perception when a sound is played at two different times (separated by a reasonable gap). This suggests that the γ dimension is an extra dimension of auditory space, separate from time. This extra dimension is the time-interval dimension of the original auditory image [8] in the new scale-shift covariant form.

SCALE-SHIFT COVARIANT AUDITORY IMAGES

We assume that the cycles dimension combines with the scale dimension of the NAP to create a {γ,σ} plane of auditory image space, in which the resonance information attached to the latest pulse of a communication sound appears as a scale-shift-covariant auditory figure. The auditory images of the four synthetic vowels of Fig. 1 are shown in Fig. 4. The dimensions of the plane are scale and cycles in {log2(cycles), log2(scale)} form. The pattern of activation that represents the message – the auditory figure – has a fixed form in all four panels; it does not vary in shape with changes in pulse rate or resonance rate. When there is a change in resonance rate (i.e., VTL) from small to large, the activity just moves vertically down as a unit without deformation; compare the auditory figures in the panels of the upper row to those in the panels of the lower row. The extent of the shift is the logarithm of the ratio of the resonance rates of the sources. When there is a shift in pulse rate from longer to shorter, the auditory figure does not change shape and it does not move vertically; rather, the diagonal which marks the extent of the pulse period, moves vertically without changing either its shape or its angle. This changes the size of the area available for auditory figures, but does not change the shape of the figure, other than to cut off the tails of the resonances when the pulse period is short relative to the resonance duration. The extent of the vertical shift of the diagonal is the logarithm of the ratio of the pulse rates of the sources. Since the pulse rate and the resonance rate were both scaled by the same amount in this example, the vertical shift (between rows) of the auditory figure and the vertical shift of the period-limiting diagonal (between columns) are the same size.

Figure 4: Scale-shift covariant auditory images of the four synthetic vowels in Fig. 1 Note that the shape of the distribution is the same all four main panels, despite the differences in VTL and GPR. The representation is scale-shift covariant for both VTL and GPR.Figure 4: Scale-shift covariant auditory images of the four synthetic vowels in Fig. 1 Note that the shape of the distribution is the same all four main panels, despite the differences in VTL and GPR. The representation is scale-shift covariant for both VTL and GPR.

Comparison of the relative positions of the distribution of activation and the period-limiting diagonal show that the shift-covariance property applies separately to the pulse rate and resonance rate, albeit in a somewhat different form. The independence is important because, although pulse rate and resonance rate typically covary with size as members of a population of animals grow, the correlation in growth rate is far from perfect. For example, in humans, both pulse rate and resonance rate decrease as we grow up; however, whereas VTL is closely correlated with size throughout life, in males, pulse rate takes a sudden drop at puberty. The figure also illustrates one of the constraints on pulse-resonance communication; although the auditory image itself is rectangular, the auditory figures of pulse-resonance sounds are limited to the upper triangular half of the plane. The lowest component of a sound cannot have a resonance rate below the pulse rate of the sound.

SUMMARY AND CONCLUSIONS

The scale-shift covariant, auditory image, and the auditory figures produced by communication sounds in this image, would appear to provide a better basis for a model of auditory perception than the traditional {time, frequency} space and the traditional spectrographic representation of speech, inasmuch as the auditory space appears to be able to explain why it is that we can hear the message of a communication sound independent of the size of the sender. If this is the space of auditory perception rather than a {time, frequency} space, and if the mechanisms are inherited and develop as a part of the auditory system, this would help to explain how children manage to learn speech in a world where the samples they experience come from people of such different sizes (parents and siblings). It would also help explain how animals with much smaller brains and shorter life spans than those of humans manage to cope with the size variability in the communication sounds or their species and other species.

The space of auditory perception and the information in the auditory image

In AIM, it is argued that the cochlea and midbrain together construct the auditory image in a {linear-time, log2(cycles), log2(scale)} space. Briefly, the basilar membrane performs a wavelet transform which creates the scale dimension of the auditory image (the vertical dimension). The scale dimension is critical to auditory perception, but the basis of perception is not a time-scale recording of basilar membrane motion (BMM). The inner hair cells and primary nerve fibres simplify the representation of the BMM since their response is limited to the amplitude peaks of the BMM, and it is the resultant neural activity pattern (NAP) that flows from the cochlea up the auditory nerve rather than the BMM. The NAP is a time-scale representation of the information in the sound that can be observed physiologically in the auditory nerve and cochlear nucleus, but as with BMM, the basis of perception cannot be the NAP itself. Firstly, the duration of the NAP record in the auditory system is limited to about 5 ms, which is the time taken for neural signals to go from the cochlea to the inferior colliculus where the code changes and temporal integration occurs. Secondly, from the perspective of perception, there is information in the NAP that we do not hear, indicating that the basis of perception is a representation that is either in, or beyond, the inferior colliculus. Thirdly, to produce scale-shift covariance, a segment of time long enough to contain the resonances of communication sounds has to be converted to a cycles/scale representation. The auditory system exhibits scale-shift covariance for resonances up to 30 ms in durations, or more. The addition of the extra dimension and the form of the {cycles, scale} plane are described in Section 2.4.2 below. The addition of this third dimension provides a space that would appear to have the properties necessary to explain the robustness of auditory perception – a simulation of the auditory images we perceive in response to communication sounds.


Acknowledgements

Research supported by UK MRC (G9900369, G0500221) and EOARD (FA8655-05-1-3043) and JSPS, Grant-in-Aid for Scientific Research (B), 18300060.

References

[1] R.D. Patterson, D.R.R. Smith, R. van Dinther, T.C. Walters: Size Information in the Production and Perception of Communication Sounds. In Auditory perception of sound sources. W.A. Yost, A.N. Popper, R.R. Fay (Eds), Springer Science+Business Media, LLC, New York (2007) in press

[2] R. van Dinther, R.D. Patterson: Perception of acoustic scale and size in musical instrument sounds. Journal of the Acoustical Society of America 120 (2006) 2158-2176.

[3] D.R.R. Smith, R.D. Patterson, R. Turner, H. Kawahara, T. Irino: The processing and perception of size information in speech sounds. Journal of the Acoustical Society of America 117 (2005) 305-318

[4] D. T. Ives, D. R. R. Smith, R. D. Patterson: Discrimination of speaker size from syllable phrases. Journal of the Acoustical Society of America 118 (2005) 3816-3822

[5] D.R.R. Smith, R.D. Patterson: The interaction of glottal-pulse rate and vocal-tract length in judgements of speaker size, sex, and age. Journal of the Acoustical Society of America 118 (2005) 3177-3186

[6] A.A. Ghazanfar, H.K. Turesson, J.X. Maier, R. van Dinther, R.D. Patterson, N.K. Logothetis: Vocal tract resonances as indexical cues in rhesus monkeys. Current Biology 17 (2007) 425-430

[7] T. Irino, R.D. Patterson: Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform. Speech Communication 36 (2002)181-203

[8] R.D. Patterson, M. Allerhand, C. Giguère: Time-domain modelling of peripheral auditory processing: A modular architecture and a software platform. J. Acoust. Soc. Am. 98 (1995) 1890-1894

[9] M. Unoki, T. Irino, B. Glasberg, B.C.J. Moore, R.D. Patterson: Comparison of the roex and gammachirp filters as representations of the auditory filter. J. Acoust. Soc. Am. 120 (2006) 1474-1492

[10] I. Daubechies: Ten Lectures on Wavelets. Conf. Series in Applied Math. SIAM, Philadelphia (1992)

[11] H.M. Reimann: Invariance principles for cochlear mechanics: Hearing phases. Journal of the Acoustical Society of America 119 (2006) 997–1004

[12] L. Cohen: The scale transform. IEEE Trans. ASSP 41 (1993) 3275-3292

[13] W. T. Fitch, J. Giedd: Morphology and development of the human vocal tract: A study using magnetic resonance imaging. Journal of the Acoustical Society of America 106 (1999) 1511-1522

[14] R.G. Baraniuk, D.L. Jones: Unitary equivalence: A new twist on signal processing. IEEE Trans. ASSP 43 (1995) 2269–2282

An Introduction to Auditory Perception
Personal tools
Namespaces
Variants
Views
Actions
Navigation