Compression and Adaptation
From CNBH Acoustic Scale Wiki
Roy Patterson
Compression and Adaptation:
Stabilising level and enhancing features in the auditory representation of sound.
In the auditory system, a separate row of hair cells along the inner edge of the basilar partition converts the motion of the partition and outer hair cells into neural transmitter, and the concentration of the neural transmitter determines the probability that sensory neurons attached to the inner hair cells will fire and carry the information about partition motion to the cochlear nucleus. Data from individual inner haircells and primary nerve fibers indicate that during the transduction process, the system compresses and rectifies the partition motion. Furthermore, it adapts to level changes rapidly, and frequency regions with relatively more activity suppress regions with relatively less activity. The presence of these processes indicates that the inner haircells and primary nerve fibers are more than just a transduction mechanism. Rather, these structures should be regarded as a sophisticated signal processing unit designed to stabilise the level of the sound and to remove the smearing introduced by the filtering and stabilising processes.
In AICAP, operations similar to those observed in the cochlea are performed by a functional model of this 'enlightened' transduction. The processing begins with a bank of units that compress and rectify the filtered waves from the gammatone filterbank. Then, a second bank of units applies two-dimensional adaptation to the compressed partition motion; that is, it applies adaptation simultaneously in time and in frequency. Adaptation in the frequency domain appears as suppression. There is one compression and rectification unit and one two-dimensional adaptation unit for each channel of the filterbank. There is also a lowpass filter at the output of each channel to limit temporal accuracy at higher frequencies and so simulate the loss of phase locking at high frequencies. The complete module converts the surface that represents our approximation to basilar partition motion into another surface that represents our approximation to the pattern of neural activity that flows from the cochlea up the auditory nerve to the cochlear nucleus.
The chapter begins with an example intended to illustrate the role of compression in stabilising the pattern of information in a sound across the enormous dynamic range of human hearing. At the same time, the example shows that compression unfortunately reduces contrast around the features that appear in the basilar partition motion. The discussion then proceeds to the adaptation process and a sequence of examples to illustrate how two-dimensional adaptation can restore contrast in the compressed partition motion, and how it can enhance contrast in the simulated neural activity pattern that is the output of this module.
Compression:
Stabilising level in the auditory representation of sound.
The auditory system has an enormous dynamic range that enables us to detect very weak signals one instant and to process sounds with 10 billion times the power a moment later (more than 10 orders of magnitude in energy terms or five orders of magnitude in amplitude terms). It is often noted, without explanation, that the auditory system, and any model of the auditory system, must introduce some form of compression to handle this enormous dynamic range. To understand the need for compression, consider what we hear as a sound changes level over a wide range, and the problem that arises if we try to produce a display of basilar partition motion for the sound as it changes level. For convenience, assume the sound is the vowel /ae/ and that the level for the sound as it appears in Figure 1.7 is 60 dBA, the mid-range of human hearing. Now imagine what happens to the display when the sound is presented once per second and the level changes from one presentation to the next in 12 dB steps, first downwards and then upwards in level. (A 12 dB change in sound level is a factor of 4 in amplitude; that is, 20log(4A/A) = 20*0.60206 or about 12). After one step downwards, the display is as shown in Figure 2.1; the glottal periods and the formants are still obvious, but the details in the middle of the glottal period are much less visible. When the level drops another 12 dB, most of the detail is gone and only the tops of the glottal ridges are clearly visible. Further attenuation makes the sound disappear entirely. Auditorily, as the level drops two steps of 12 dB, the sound becomes noticeably quieter but it is clearly the same sound, indicating that the detail is still available to the listener in a way that it is not available in the visual display. Auditorily, the sound has to go down another two steps before the sound loses some of its identifying characteristics, and one or two steps further still before is fades away entirely. When the level is raised one step of 12 dB, the display is as shown in Figure 2.2; the glottal periods and the formants are still obvious but the large features begin to obscure the details of the small features. When the level increases another 12 dB, the display is very difficult to interpret because it is dominated by nearly vertical lines that go to and come from peaks that are not visible and so it is not even clear which lines are associated with peaks and which with valleys. Auditorily, as the level rises two steps of 12 dB, the sound becomes noticeably louder but it is clearly the same sound, and the detail is still available among the enlarged ridges and formants in a way that it is not available in the visual display. Auditorily, the sound has to go up another two steps before the sound changes quality noticeably, and then it is because of the upward spread of masking and the intrusion of harmonic distortion. So the range of a display of basilar partition motion has to be carefully tuned to produce figures like those in Chapter 1 where the amplitude scale is linear. Once this is done, the display can be very useful, but it should be noted that a representation with a linear scale does not provide a good representation of sound as we hear it. Similarly, a mechanism designed to process this representation of sound with a numerical range limited to, say 48 dB, (8-bit digits) would alternately overflow (distort) and underflow (drop out) as the level of the vowel passed into and out of its range. Extending the example, briefly, if we were to attempt to provide a real-time display of basilar partition motion for people conversing in a room, using a linear representation of sound level, we would find a) that the level of the activity in the display fluctuates wildly, even for a single speaker at a moderate distance from the microphone, and b) that the speech of those near the microphone is off-scale high, while the speech of those some distance from the microphone is off-scale low. The dominant perception from the display would be one of rapid, large-scale vertical motion. that does not correspond to the sounds we hear. The people in the room would experience no difficulties with the level variation; indeed they would be largely unaware of the level fluctuations even though they are vast in energy terms. The visual part of this example indicates that the range of a visual display is about 48 dB (8 bits); if the largest feature has a size of 128, then we will have difficulty seeing the shape of features whose peaks are of size 1 or less on the same scale. The auditory nerve also has a limit on its ability to represent levels, and the limit is probably much closer to 8 bits than the 16 bits that would be required to cope with the 96 dB and more that people hear. So how is it that the auditory system as a whole can handle a much larger dynamic range? Part of the answer is that the cochlea compresses the sound level before transmitting it up the auditory nerve -- a process that stabilises the pattern of information in the partition motion across levels.
Compression in the auditory system and in AICAP: The cochlea applies compression to sound in two stages: The peaks in the mechanical motion of the basilar partition increase monotonically in height as the input sound level increases, but they do not increase in direct proportion to the sound level. At high levels, it takes a larger increase in energy to raise a peak than it does at low levels. So the range of peak heights associated with a 10-dB range of sound levels is compressed at high sound levels with respect to that at low sound levels. Similarly, the amount of neural transmitter deposited in the synapses between an inner haircell and its attendant primary fibers increases monotonically with the force on the hairs of the top of the cell, but the transmitter level does not increase in direct proportion to the force level. The range of transmitter levels is compressed with respect to the range of force levels. We know a considerable amount about the shape of the compression function for an individual hair cell, and something about the compression applied outer haircells. However, what we need for a functional model of the cochlea is the form of compression for the system as a whole, as seen, for example from the perspective of the cochlear nucleus, and this is still not known. We assume that the auditory system includes compression for three related reasons: to increase its dynamic range, to stabilise sound level in the auditory representation, and to preserve pattern over level changes, and it is these general properties that we attempt to preserve in AICAP.
The compressive action of the outer and the inner haircells is combined and simulated with one stage of logarithmic compression in AICAP. The compression is applied to the waveform at the output of the auditory filter. Logarithmic compression has the advantage that it produces output surfaces with identical shapes for sounds that differ only in level. It is also a particularly appropriate compression function for adaptation to the output of gammatone filters, as will become clear in the next Section. The form of the compression function for the auditory system as a whole is not known in detail; it is probably less compressive than a logarithmic function but whether the difference is significant is, as yet, unclear. Thus, the strategy with regard to compression in the transduction module is as follows: We have chosen to mimic auditory physiology to the extent of implementing compression at the point in the processing chain where we see strong compression in the auditory system. At the same time, in the face of incomplete physiological information as to the form of the compression in the auditory system, we have chosen to implement the compression in a form that we understand mathematically. It is assumed that logarithmic compression will lead to a surface of simulated neural activity that is not too different from that in the auditory nerve.
Contrast reduction: When logarithmic compression is applied to the basilar partition motion produced by the vowel /ae/ (Figure 1.7), the result is the compressed vowel surface shown in Figure 2.3. This representation has the distinct advantage that when the level changes the shape of the surface does not change -- the surface rises or falls as a unit by a relatively small amount. With regard to the conversation example, the fluctuations in level would no longer be the dominant perception and speakers near to and far from the microphone would be visible on the same scale. Unfortunately, the compression operation has a troublesome side-effect; it produces a reduction in the contrast of the features that appear in the basilar membrane motion, that is, the formants, are less well-defined in this representation and it is true at all levels. The reduction in contrast occurs simply because in the output representation, small elements are increased in numerical value relative to large elements, and this is true for all compression processes. Adaptation alleviates the contrast reduction and that is the topic of the next Section of the Chapter. Before proceeding, we turn briefly to the topic of rectification.
Rectification: In the auditory transduction process, the response to motion is highly asymmetric; that is, the degree of response is determined by the degree of motion in one direction, but in the opposite direction, there is virtually no response. In the case of the outer hair cells, the asymmetry is caused by the physical arrangement of the hairs on the cells. The hairs are pulled apart by motion in one direction; this opens ion channels and initiates a response whose level increases with the degree to which the hairs are separated. Motion in the opposite direction pushes the hairs together and simply holds the ion channels shut. So we have a monotonic response in one direction and virtually no response in the opposite direction. In the case of the inner haircells and primary nerve fibers, motion in one direction causes neural transmitter to be released into the synapses between the haircell and its primary nerve fibers, and the amount of transmitter is determined by the degree of motion. Motion in the opposite direction simply causes a cessation of transmitter release. The primary fibers fire when the transmitter level is increased above some threshold, and so the response of this subsystem is highly asymmetric. Whether the rectification observed in the cochlea has a purpose in signal processing terms is not clear. In some radio transmission systems, rectification is used in the receiver to help separate a modulation signal from the carrier wave. In speech sounds the glottal pulse rate appears in the upper channels of the basilar partition motion as a modulation of the sounds harmonics. The glottal rate determines the pitch of the sound which plays an important role in intonation, but it is not clear whether the rectification we observe plays an important role in extracting the pitch of speech sounds. In any event, in AICAP, rectification is applied to the basilar partition motion at the same time as the compression. Whenever negative portions are encountered in the surface of partition motion, they are simply set to a floor value just above zero -- a procedure that half-wave rectifies the partition motion. Thus, as in the case of compression, we have chosen to mimic auditory physiology to the extent of implementing rectification at the point in the processing chain where we see rectification in the auditory system. We have implemented the rectification in a simple mathematical form and it is assumed that the pattern of information in the rectified surface of simulated neural activity is not too different from the information in the auditory nerve.
Two-Dimensional Adaptation:
In hearing research, we are accustomed to thinking of adaptation in the time domain and suppression in the frequency domain. In point of fact, adaptation and suppression are just contrasting perspectives on the results of a sharpening process, and both terms can be applied to both domains. A sharpening process elevates peaks and deepens valleys in a given function. Suppression emphasises what is lost in the sharpening process -- the low-level information between peaks. Adaptation emphasises what is preserved -- information in the region of the peaks. When speaking of auditory adaptation, we are usually describing how the system preserves important temporal information as the input level varies over a large range. At the same time, we should note that the process also suppresses low-level activity between temporal peaks. When speaking of auditory suppression, we are usually describing how a small signal at one frequency is driven below threshold by a large masker at a nearby frequency. At the same time, we should note that the process also enhances high-level peaks between spectral valleys. In AICAP, the sharpening is performed simultaneously in time and in frequency and, within a two-dimensional region defined by the bandwidth and the integration time of the auditory filter, the time and frequency information interact. So, in labelling the process 'two-dimensional adaptation', Holdsworth chose to emphsise a) the positive characteristic of the process -- adaptation, rather than suppression, b) the similarity of the sharpening processes in the two dimensions, and c) the coupling of the dimensions. This Section begins with a pair of examples intended to explain adaptation, first in the time domain and then in the frequency domain. Brief versions of these examples appeared originally in Holdsworth (1990). Then, the complete process is applied to the compressed partition motion of the click train and the vowel /ae/ to illustrate the application of two-dimensional adaptation to broadband sounds, and the effect of the main parameters on simulated neural activity pattern that is the output of this module.
Adaptation in the Time Domain:
Consider the problem that arises when one attempts to distinguish a click and an idealised formant as they appear at the output of one channel of the auditory filterbank. The sounds are presented as a composite signal in Figure 2.4a with the click at the origin (0 ms) and the formant beginning 20 ms later. The idealised formant is a click that has been passed through a resonance. When this composite sound excites the auditory filter centred at 1.0 kHz the result is as shown in Figure 2.4b. Whereas the two components of the sound appear very different in Figure 2.4a, they appear rather similar in Figure 2.4b. Basically, this is because they occur in conjunction with the impulse response of the filter in Figure 2.4b; that is, they have been convolved with the impulse response of the filter and, since it is long relative to the click, it extends the click in time and imparts its own shape to the convolution product. This smearing of the temporal structure of a click sound is referred to as 'ringing', from the way a bell rings when stuck by a hammer (the inputing of a large click!). Ringing is an unavoidable by-product of the spectral analysis performed by the auditory fliterbank and the basilar partition operates under precisely the same constraint. Thus, much of the activity in the basilar partition response to click trains and streams of glottal pulses in Chapter 1 is due to ringing of the basilar partition rather than the sound, per se.
It is important to be able to distinguish a click from a formant and we would like the auditory representation of sound to present the difference, as much as possible, in its original form. That is, we would like the click to be more constrained in time -- more click like. In general terms, this suggests that one should measure the output of the auditory filter relative to the filter's impulse response. In practice, this is accomplished using information about the envelope of the impulse response of the auditory filter and an adaptive thresholding mechanism. The argument is most easily understood in terms of compressed partition motion; the compressed version of the filtered composite sound is shown in Figure 2.4c by the solid line. The peaks of the compressed click response show the envelope of the filter's impulse response. Beyond the maximum in the response, the rate of decay is close to linear in this coordinate system because the decay term of the gammatone filter is a negative exponential and the compression is logarithmic. The dots inside the compressed formant in Figure 2.4c show the envelope of the impulse response of the filter. Comparison of the stream of dots and the peaks of the compressed formant shows that the formant causes the filterbank output to decay more slowly than in the case of the click. The difference can be enhanced, as suggested above, by measuring the output of the filter relative to the filter's impulse response. The enhancement is implemented with the aid of process of adaptive thresholding since this provides control of the enhancement and enables us to determine the relative emphasis of large and small features in the simulated neural activity pattern.
The dotted line in Figure 2.4c shows a typical adaptive-threshold contour. In this particular case, there is a contrast between the rate at which the threshold is permitted to rise and the rate at which it is permitted to decay. There is virtually no restriction on the rate of rise and so the adaptive threshold follows the leading edge of the signal at onset. The rate of decay is limited to a rate slightly less than that of the impulse response of the filter. As a result, cycles of the compressed filterbank output which are simply due to the ringing of the filter fall just below the adaptive threshold as shown in the lefthand portion of Figure 2.4c. In the righthand portion, the compressed formant response repeatedly exceeds the adaptive threshold showing that, despite its similarity to the compressed click response, the compressed formant represents a different sound source. Thus, adaptive thresholding helps discriminate between aspects of partition motion that represent characteristics of the source and aspects that just reflect the constraints for the spectral analysis process.
When the compressed filter response in Figure 2.4c is measured relative to the adaptive threshold in that sub-figure, the result is as shown in Figure 2.4d. Here the click response is much shorter than the formant response, and more like the original click. In AICAP, it is this representation of sound that forms the basis of the surface that is the output of the compression and adaptation module. The surface is intended to simulate the neural activity pattern flowing from the cochlea to the cochlear nucleus and so it needs to include the effects of adaptation in the frequency domain before it is complete.
Adaptation in the Frequency Domain:
The procedure for implementing adaptive thresholding in the frequency domain is analogous to, although not strictly the same, as the procedure for implementing the process in the time domain. In the frequency domain, the results are often interpreted as suppression rather than enhancement. Once again we use a composite sound to demonstrate the operation of the mechanism. In this case it is a pair of concurrent sinusoids, one at 1000 Hz and the other at 2300 Hz, and the latter component is 24 dB weaker than the former. The composite waveform is shown in Figure 2.3a; it looks like a 1-kHz sine wave because the high frequency component is so much lower in level (a factor of 16 in amplitude). The long-term power spectrum of the composite sound is shown in Figure 2.3b.
The auditory spectral analysis of this composite signal is shown in Figure 2.3c as a plot of signal power versus auditory filter centre-frequency. In point of fact, the solid line in Figure 2.3c shows the instantaneous value of the sustaining envelope of the filterbank output; that is, the solid line shows the contour produced by joining the maxima of all of the individual filter envelopes at this instant in time. Since the skirts of the individual filters are not infinitely steep, all of the filters in the region of the 1-kHz tone respond to the tone. The degree of response decreases as the difference between the filter centre frequency and the tone frequency increases, but the spread of activity is significant about both tones. This spreading is an unavoidable property of any spectral analysis that has a finite temporal response and it is analogous to ringing in the time domain. The analogy suggests that we could enhance differences between spectral components of sounds, if we measured activity in the region of a peak relative to the smearing function imposed by the filters' finite bandwidths. That is, the output representation should take note of the fact that activity in the region of a spectral peak can only fall away at a certain rate, as determined by the auditory filter shape.
Once again we use the adaptive thresholding technique to prevent the mechanism from producing output that is simply a by-product of the operation of the filterbank itself. We construct a threshold that is suspended by the peak of a local maximum and which can only drop away from that local maximum at a limited rate as shown by the dotted line in Figure 2.3c. The output of the device is measured relative to the adaptive threshold and it is shown for the composite sound in Figure 2.3d. Note that the features in Figure 2.3d associated with the two frequency components are considerably sharpened relative to their representation in Figure 2.3c, and the activity in the region between them is driven below the floor of the adaptive threshold.
There are two aspects to the adaptation process: on the one hand it enables us to restrict the amount of ringing and spectral smearing that appears in the representation of sound; on the other hand, it enables us to go beyond simple adaptive filtering and determine the emphasis between large and small features in the output representation. Note that, in both the time domain and the frequency domain, the adaptive threshold falls away from the local maximum at a rate that is not quite as fast as the filterbank would permit. This reduction of slope means that small features will be suppressed in the region of a large feature both in time and in frequency. In general, this is a useful property when attempting to separate signals from noise, and it is a property observed in auditory nerve fibers. In AICAP, the surface that is produced by the application of two-dimensional adaptation to the compressed and rectified surface of basilar partition motion is the model's simulation of the neural activity pattern (NAP) that flows from the cochlea to the cochlear nucleus. The control and effects of the adaptation are the topic of the next sub-Section. The details of adaptive thresholding and a circuit diagram for the process are presented in Holdsworth (1990).
The compression/adaptation module is a functional model of the transduction process observed in the cochlea, rather than a collection of units, each of which is intended to simulate the activity a single outer or inner hair cell. There is only one cascade of compressor, rectifier, adaptation unit and lowpass filter for each auditory filter, and that single cascade is intended to represent the activity of all of the outer and inner hair cells and all the primary fibers associated with that auditory filter, or that 0.9 mm of basilar partition. So the individual pulses that make up the fine structure of the NAP are not individual nerve spikes; nor are they the probability functions for individual nerve spikes. Rather, it is better to think of an individual pulse in the NAP as the aggregate activity of all the primary fibers associated with the region of one auditory filter, as they converge on the dendrites of a primary reception cell in the cochlear nucleus. Together, the gammatone auditory filterbank and the compression/adaptation transduction constitute a cochlea simulation that can process broadband, everyday sounds with a wide range of levels. It converts a sound wave into a representation of the complex neural activity pattern that arrives at the cochlear nucleus in response to the sound.
Feature enhancement in the neural activity pattern (NAP):
The simulated neural activity pattern (NAP) produced by the full cochlea simulation in response to the 8-ms click train is shown in Figure 2.6. In the high-frequency channels, each click has a sharper onset, a higher peak and a shorter duration than it has in the basilar partition motion (Figure 1.xxx), indicating that much of the temporal smearing of the gammatone filters has been removed. In the low-frequency channels, the spectral sharpening has enhanced the separation of the resolved harmonics partly by elevating the peaks and partly by driving activity in the valleys below threshold. There is also temporal sharpening; the individual half-cycles of the rectified filtered waves have been sharpened in time so that their bases occupy about one quarter, rather than one half, of the period of the harmonic. Thus, two-dimensional adaptation enhances both the spectral and the temporal information in the representation of neural activity and it does it at the level of the fine structure of the pattern as well as at the level of the envelope.
The feature enhancement process is much more effective than the click train example might at first indicate, since the enhancement is actually performed on the compressed filterbank output, rather than on the basilar partition motion. The NAP produced by the vowel /ae/ is shown in Figure 2.7 and the compressed partition motion from which it was derived is that shown in Figure 2.3. In this earlier figure, the ringing in the region of the formants carries on throughout the glottal cycle and the formants are so smeared in the spectral dimension that would be hard to distinguish if we were not already familiar with the partition motion for the vowel. Following adaptation (Figure 2.7), the formants have been sharpened, even when compared with the partition motion (Figure 1.xxx), and much of the filterbank ringing has been removed. The three features with triangular footprints in the upper half of the figure are formants 2, 3 and 4. The formants get shorter in duration as formant number increases because auditory filters get wider and ring less as centre frequency increases. At onset, each glottal pulse produces a transient that joins the upper formants momentarily -- a characteristic that would assist in recognising that these features come from one source. At the top of the figure, in the channels associated with the fourth formant, there is a patch of activity in each cycle of the vowel just after the formant proper dies away. This is a consistent feature of this speaker's voice that is not readily observed in the basilar partition motion.
The parameters that control adaptation were set to moderate values when generating the NAP in Figure 2.7. The effect of adaptation in the time domain on a natural, broadband sound, is illustrated in the next pair of figures where the temporal adaptation parameter has been increased to a relatively high level (Figure 2.8) and decreased to a relatively low level (Figure 2.9). A comparison of the upper parts of the two figures shows a) the glottal pulse ridge is enhanced by adaptation, b) the formants are enhanced inasmuch as detail in the region around the formants is suppressed, c) the formant duration is curtailed by adaptation, although the shape of the triangular footprint of formants 2, 3 and 4 is largely unchanged. The level of adaptation in Figure 2.8 is almost undoubtedly too high to represent the hearing of young normal listners, since virtually all of the activity in the valleys between glottal pulses is removed and these listeners hear at least some of this information. The level of adaptation in Figure 2.9 is almost undoubtedly too low as will become clear later when the speech is presented in noise. A comparison of the lower halves of the two figures shows how adaptation increases the resolution of the sounds lower harmonics. When the level of adaptation is low (Figure 2.9), the glottal ridges run virtually unbroken down to the lowest harmonics; the patterns of the second and third harmonics are connected once per cycle. As the adaptation increases (Figures 2.7 and 2.8), the harmonics become separated except in the region of the first formant where the patterns are still connected virtically. So in the low frequency region of broadband sounds, adaptation does not suppress activity midway through the glottal cycle; rather it suppresses activity along the edges of partially resloved harmonics and so increases their isolation.
The effect of adaptation in the frequency domain on a natural, broadband sound, is illustrated in Figures 2.10 and 2.11 where the frequency adaptation parameter has been increased to a relatively high level (Figure 2.10) and decreased to a relatively low level (Figure 2.11). A comparison of the upper parts of the two figures shows a) the glottal pulse ridge is enhanced a little by frequency adaptation, b) the formants are enhanced as detail in the surrounding region is suppressed b ut the effect is not as large as that induced by temporal adaptation, c) the formant duration is extended by adaptation; the shape of the triangular footprint of the formants is elongated. The appropriate level for frequency adaptation is discussed in the next Section. A comparison of the lower halves of the two figures shows that frequency adaptation has relatively stronger effects at lower center frequencies. It increases the resolution of the lower harmonics. When the level of adaptation is low (Figure 2.11), the pattern is very similar to that when temporal adaptation is low (Figure 2.9). When the adaptation level is high (Figure 2.10), harmonics 2 and 3 are completely isolated. At the same time, the harmonics in the first formant are still connected virtically.
Finally, note that the low-frequency channels at the bottom of the figure are still delayed relative to higher-frequency channels. This is important because we do not hear changes in a sound (monaurally) when the degree of low-frequency delay is manipulated by as much as five milliseconds. This is one demonstration of the fact that there is information in the neural pattern travelling up the auditory nerve that we do not hear, which in turn, indicates that there must be at least one more stage of processing before our initial image of the sound is formed.
Excitation patterns: the development of frequency selectivity and adaptation over time.
The discussion of adaptation in the previous Section was primarily concerned with the sharpening of features in that part of the neural activity pattern associated with the steady-state part of the sound. The topic of this Section is the onset of adaptation and sharpening. The discussion begins with a review of the development of frequency selectivity at the onset of a sound for the click train and /ae/, so that the development of frequency selectivity can be distinguished from sharpening due to adaptation.
The development of frequency selectivity in the auditory filterbank is illustrated in Figure 2.12 which shows a sequence of spectra calculated at 8-ms intervals over the first 64 ms of the 8-ms click train. Specifically, the spectra show the outputs of 256 auditory filters, spanning the region 100 to 5000 Hz, after compression, rectification and integration by a two-stage lowpass filter with an integration time of16 ms. The abscissa is filter center frequency on an ERB scale; the ordinate is log-compressed magnitude so it is xxx decibell scale. The first spectum, at the bottom of the figure, shows no selectivity in any frequency region since the sound to this point is just a single isolated click. The fact that it is the beginning of a periodic sound is not known to the system at this time. In the upper half of the spectrum where no resolved harmonics develop, the spectra rise monotonically to their steady-state level. The rolloff in frequency above 4 kHz is due to the audiogram and the fact that the clicks have a finite duration. In the lower half of the spectrum, the second spectrum shows the beginnings of selectivity for the lower harmonics and by the third spectrum the frequency bands of the resolved harmonics are apparent. The selectivity continues to develop over the next four spectra as the peaks of the resolved harmonics grow faster than the floors of the valleys between them. The rolloff in frequency below the third harmonic (375 Hz) is due to the audiogram; in the absence of the audiogram weighting function the resolved harmonics all have the same level. The functions are referred to as 'auditory spectra' to distinguish them from the traditional linear spectra derived with the Fourier transform.
A similar sequence of spectra for a 64-ms segment of the steady-state portion of the vowel /ae/is shown in Figure 2.13. It is not vowel onset that is portrayed, but the development of selectivity for a steady-state vowel segment. The lower half of the spectrum shows the development of the resolved harmonics of the vowel; the first formant is indicated by the pair of large harmonics at ERB's 11.5 and 13 -- the fourth and fifth harmonics, respectively. The upper half of the spectrum shows the development of the second, third and fourth formants rather than high harmonics. The development of selectivity is similar to that for the resolved harmonics in the sense that the level rises monotonically at all frequencies and the peaks grow faster than the valleys. The harmonics themselves are not resolved in this region of the spectrum. The formant peaks show that selectivity develops over the first 64 ms of a sound in this region of the spectrum in the same way as it does at lower frequencies when there are spectral contrasts on the appropriate scale.
When two-dimensional adaptation is applied to the compressed partition motion before temporal integration, the auditory spectra of the click train are sharpened as shown in Figure 2.13. There are now eleven resolved harmonics in contrast to the six in the auditory spectra. The peak to valley ratios are much around the resolved harmonics are much larger and the width of the resolved components is much narrower. Furthermore, in the valleys between peaks, level does not grow monotonically with time in the region of the first six harmonics; the second excitation pattern is entirely above the first in this region, but thereafter, the valleys of successive excitation patterns are progressively deeper than their predecessors. The attenuation of the audiogram reduces the size of the effects in the region of the first harmonic. For harmonics seven through eleven, the valley levels rise to an asymptotic value in the third pattern, while the peaks go on rising for another 25 ms, or so. Finally, in the region of unresolved harmonics, the excitation patterns simply rise monotonically to their asymptotic level. Similar effects are observed for the steady-state vowel /ae/ as shown in Figure 2.14. The resolved harmonics in the lower portion of the patterns are more resolved than in the corresponding auditory spectra because the valleys are suppressed to lower and lower values from the third excitation pattern onwards. In the upper portion of the patterns, the formants are sharpened in two ways; the peak to valley ratios are greater even though the valleys between formants are not progressively suppressed to lower and lower values, and the harmonics in the formant peaks become partially resolved. Finally, note that the sixth and eighth harmonics become clearly resolved in the excitation pattern in the region between the first and second formants.
The excitation patterns for the click train and vowel in Figures 2.14 and 2.15 were produced with moderate adaptation levels, both in time and in frequency -- the same levels as were used to produce the neural activity patterns with moderate adaptation in Figures 2.6 and 2.7. When the level of frequency adaptation is increased to the higher level illustrated in the NAP of Figure 2.10, or to the lower level illustrated in the NAP of Figure 2.11, the sequence of excitation patterns (EP's) is as shown in Figures 2.16 and 2.17, respectively. The phenomena described above for the condition of moderate frequency adaptation still apply when the level of adaptation is increased or decreased; there are more resolved harmonics, the peak-to-valley ratios are greater because the valleys are suppressed, and level rises monotonically at high frequencies where there are no spectral contrasts in the sound. However, the difference in the range of the effects is striking and there is a pronounced frequency interaction. As the level of adaptation increases, not only are the valleys are suppressed, the peaks are raised, and the effect is much greater for the low harmonics where the filters are narrow. The combined effects of narrowband filtering and sharpening emphsises the importance of accurate bandwidth estimates at low centre frequencies. If the intercept of the ERB function is doubled from 25 to 50 Hz, the asymptotic EP for moderate frequency adaptation (Figure ) is essentially flat from the second harmonic up to 4 kHz where the spectrum rolls off. A critical band function of this form is still more selective at low frequencies than the Bark scale.
Figures and legends for Chapter 2
Figure 2.1. Basilar partition motion for the vowel /ae/ after attenuation by 12 dB.
Figure 2.2. Basilar partition motion for the vowel /ae/ after amplification by 12 dB.
Figure 2.3. Basilar partition motion in response to the vowel /ae/ after logarithmic compression and half-wave rectification.
Figure 2.4 Schematic representation of the operation of adaptation in the time domain: a) test waves -- a click and a resonant impulse; b) gammatone filtered test waves; c) compressed version of the filtered test waves; d) adapted version of the compressed test waves.
Figure 2.5 Schematic representation of the operation of adaptation in the frequency domain: a) test wave -- two tones 24 dB apart in level; b) idealised spectrum of test wave; c) auditory spectrum of the test wave; d) excitation pattern for the test wave.
Figure 2.6 Neural activity pattern produced by the transduction module for the 8-ms click train when the level of adaptation is moderate in both the time and frequency dimensions.
Figure 2.7 Neural activity pattern produced by four cycles of the vowel /ae/ with moderate temporal adaptation and moderate frequency adaptation.
Figure 2.8 Neural activity pattern produced by four cycles of the vowel /ae/ with strong temporal adaptation and moderate frequency adaptation.
Figure 2.9 Neural activity pattern produced by four cycles of the vowel /ae/ with weak temporal adaptation and moderate frequency adaptation.
Figure 2.10 Neural activity pattern produced by four cycles of the vowel /ae/ with moderate temporal adaptation and strong frequency adaptation.
Figure 2.11 Neural activity pattern produced by four cycles of the vowel /ae/ with moderate temporal adaptation and weak frequency adaptation.
Figure 2.12 A sequence of Auditory Spectra for the first 64 ms of the 8-ms click train. Level rises monotonically with time at all frequencies above 2 ERB. The low harmonics become progressively more resolved because their peaks grow faster than the valleys between them.
Figure 2.13 A sequence of Auditory Spectra for a 64-ms segment of the vowel /ae/. Level rises monotonically with time at all frequencies, and the low harmonics become progressively more resolved as their peaks grow faster than the valleys between them. A similar effect results in the development of selectivity around the formants in the upper halves of the spectra.
Figure 2.14 A sequence of Excitation Patterns for the first 64 ms of the 8-ms click train. There are more resolved harmonics than in the corresponding auditory spectra and the peak-to-valley ratios are much greater.
Figure 2.15 A sequence of Excitation Patterns for a 64-ms segment of the vowel /ae/. Both the resolved harmonics in the lower portion of the patterns and the formants in the upper portion of the patterns have been sharpened by the adaptation process, but the degree of sharpening is greater at lower frequencies.