The Time-Interval Dimension of the Auditory Image
From CNBH Acoustic Scale Wiki
Roy Patterson
TEMPORAL INTEGRATION AND THE TIME-INTERVAL DIMENSION OF AUDITORY SPACE
When the input to the cochlea is a periodic sound, like a stationary vowel or musical note, the neural activity pattern (NAP) of the sound is a regular shape, or figure, that repeats across the NAP at intervals corresponding to the period of the sound. A brief segment of the NAP of the vowel in 'mat' is presented as an example in Figure 3.1. In general terms, the rate at which the figure repeats corresponds to the pitch of the sound, the volume of the figure corresponds to the loudness of the sound, and the shape of the figure corresponds to the timbre of the sound, or in this case, the vowel quality. The upper half of each vowel figure is composed of three right-pointing triangles which are the second, third and fourth formants of the vowel as they appear in the NAP. The formants reveal the form of resonances that existed in the vocal tract of the speaker at the moment the word was spoken. Thus, the shapes of the figures that appear in the NAP provide detailed information about the source of a sound and the state of the source of when the sound was produced.
The NAP is a reasonably good representation of sounds as they occur at an early stage in the auditory system. But the NAP is not a good representation of our perceptions. The primary problem is that periodic sounds produce stable auditory images, whereas the NAP is anything but stable. Consider what the NAP represents and what we hear: firstly, with regard to loudness and activity levels in the NAP, and secondly, with regard to auditory discrimination and the rate of flow of the NAP.
With regard to neural activity and loudness, the individual pulses of the NAP are intended to represent the level of activity at a particular moment in a particular channel of the auditory system -- something like the aggregate firing rate of the neurons associated with one auditory filter as a function of time. The NAP of the vowel in 'mat' shows that the level of activity in the region of the third and fourth formants is close to its maximum when a glottal pulse occurs and close to zero a few milliseconds later in the latter half of the glottal cycle. Furthermore, there is a reliable dead period between adjacent NAP pulses even when the stimulus intensity is greatest. Despite these rapidy variations in neural activity level, vowels do not give rise to the sensation of rapid loudness fluctuations. Periodic and quasi-periodic sounds give rise to our most stable auditory images; indeed, they have stable pitch and stable timbre as well as stable loudness.
With regard to rate of flow and auditory discrimination, the NAP is like a multi-channel chart recording of the information generated in the cochlea; in this analogy, the cochlea is on the right of the figure and the chart recording flows from right to left. Consider the contrast between the rate at which information flows from the cochlea and our ability to hear small features in the valleys between the large ridges. In Figure 3.1, four and a half cycles of the vowel sound occupy 100 mm of recorder paper. If one attempted to simulate our auditory image of the vowel using a NAP with the resolution shown in Figure 3.1 and moving the NAP at the rate it is produced, the chart recording would have to pass through Figure 3.1 at a rate of 3.2 meters per second! At this rate, the shape information in the NAP would be blurred beyond recognition and small features in the valleys between the ridges would be completely obscured. We know that listeners can process information in the valleys and that it is perceived as stable despite the rate of flow of the NAP. For example, when the odd harmonics of a click train are phase shifted as a group, small secondary peaks appear in the NAP half way through the period of the click train. These secondary peaks are audible as an increase in the tone-height of the sound (Patterson, 1987; 1990) but they would not be discriminable in a real-time display of the NAP. So there is a conflict between the rate at which the NAP flows and our ability to hear features in the sound associated with small details in the NAP. Thus, the NAP has information corresponding to some of the things we hear in our auditory images but the information has a different form than our perceptions indicating that it undergoes more processing before our initial experience of the sound.
The NAP also contains information that is not preserved in the auditory image and cannot be heard with any amount of practice. For example, the rightwards skew in the low-frequency channels of the NAP is a natural form of global phase shift introduced by the auditory filtering process. The degree of skew can be reduced or accentuated in specially constructed sounds. These phase modifications have a marked affect on the NAP but, over a relatively large range, they do not produce a discernable change in our perception of the sound (Patterson, 1987xxx).
This chapter describes how the NAP of a sound can be transformed into a better representation of our auditory image of that sound through a process of quantised temporal integration that is synchronised to the sound. The resulting simulation of the auditory image is stable when the sound is periodic despite the fact that the NAP information is entering the process at the rate it is being produced and despite the rapid level fluctuations in the NAP. The image changes when and if we hear a change in the sound and it tracks changes as they occur. Small periodic features in the NAP appear with the same resolution in the auditory image and in the same relative position. Global phase differences that we do not hear are removed as the simulated auditory image is assembled. This, it is argued, is a better representation of our auditory images.
The temporal integration problem: The fact that we hear a stable sound in response to oscillating neural activity shows that there is some form of temporal integration in the auditory system after the formation of the NAP and prior to our initial perception of a sound. Until recently, it was assumed that auditory temporal integration was a simple averaging process -- that each channel has a leaky integrator in the form of a one-stage lowpass filter (Jeffress, xxx; Viemeister and Green, xxx). If the decay constant of the leaky integrator is long with respect to the period of the sound, the average at the output of the integrator will be stable even though the input is fluctuating. The model is just like that used in vision to explain the stable visual perception produced by a flourescent light. The light source flickers on and off at the rate of the power supply (50 or 60 Hz), but we percieve the light as a stable source because the visual system integrates light over a 70 ms period. If this simple temporal averaging were the basis of temporal integration in the auditory system, the set of long term averages from all of the leaky integrators could be assembled into a 'central spectrum' and this could form the basis of the stable perceptions associated with periodic sounds. Unfortunately, this simple model of TI does not work for the auditory system. If the integration time of the temporal window is long enough to stabilise the output, it is so long as to smear out details of the NAP that we hear. This is the computational version of the problem of perceiving small features in a rapidly flowing NAP.
This contrast between the speed of the NAP and auditory resolution suggests that the auditory system has a means of integrating one cycle of a periodic NAP with the ones before it and after it, while avoiding integrating within the individual cycles which would smear the details we appear to hear. That is, it appears to have a means of combining the ridge part of one cycle of the NAP with the ridge parts of preceeding and succeeding cycles, and combining features in the valley region of one cycle of the NAP with their counterparts in preceeding and succeeding valleys of the NAP. At first glance, this does not appear possible, and indeed, with the traditional temporal integration mechainsm it is not possible. There is, however, a solution to the problem which is to quantise the temporal integration process and synchronise it to the larger pulses in the NAP (Patterson, 1989). This Chapter describes Quantised Temporal Integration (QTI) as it applies to periodic sounds -- specifically, click trains and vowel sounds. It explains the boundary between transients and tonal sounds and reveals the range of vocal qualities available to professional actors and singers. When QTI is applied to aperiodic sounds, it distributes the activity across the image and increases the temporal variability relative to that observed in the NAP. The application of QTI to aperiodic sounds is the topic of the Chapter 4. It shows that QTI enhances the difference between periodic and aperiodic sounds and suggests that auditory figure/ground separation is based on temporal regularity in the auditory image -- information that is deliberately discarded in traditional models of temporal integration and current speech recognition systems.
Quantised Temporal Integration
In the case of transients, periodic sounds and quasi-periodic sounds, the information about the source of the sound comes in packets. For a transient the packet is isolated in time; for periodic and quasi-periodic sounds, the duration of the packet is the period of the sound at that moment. In the case of vowels and musical notes, the shape information in the packet changes relatively slowly when compared with the rate of packets, and so the auditory figure in one packet is typically rather similar to the one that preceeds it and the one that follows it in the NAP. These observations suggest that it would be useful to have a form of temporal integration that is sensitive to the discrete nature of the shape information and which aligns the packets before combining them, so that like parts of the auditory figures in successive packets are combined while the details of the figure are preserved. Examination of the individual channels of the NAP's of periodic sounds reveals that there is almost always one clearly defined maximum per period Patterson (1989). This property follows directly from the fact that the filters that define the channels have relatively narrow bandwidths. As a result, they limit the rate at which the envelope of the output of the filter can change and this rate is slow relative to the duration of the period of vowels and musical notes. This in turn suggests that the repeating pattern in the NAP of a periodic sound can be stabilised by strobing copies of the NAP into an image buffer at the instant the NAP passes through a local maximum. The strobe unit quantises the integration process and alignes the maxima of successive packets prior to integration. For periodic sounds, it matches the temporal integration interval to the period of the sound and, much like a stimulus-driven stroboscope, it produces a stable image of the repeating pattern in the NAP.
The details of auditory image construction are presented with the aid of two examples in the next two subsections: The first example illustrates QTI for a single frequency channel over the course of one cycle of a periodic sound; that is, it illustrates the microstructure of auditory image construction. The second example illustrates the construction of the full auditory image for a set of click trains with click rates from 1 - 128 per second. At the lower rates we hear this sound as isolated clicks (transients); at the higher rates we hear it as a musical note. Explaining the change in perception from isolated transients without pitch to a tonal sound with pitch is an important test for any potential auditory temporal integration mechanism. It also reveals the factors that determine the lower limit of pitch in a time-domain model of hearing and the boundary between transient sounds and tones.
QTI for an isolated channel: The construction of one channel of the auditory image over the course of one cycle of a click train is illustrated in Figure 3.2. The period of the sound is 8 ms and the centre frequency of the channel is 1.0 kHz, roughly half way up the auditory figure of the click train shown in Figure xxx. The subfigures on the left side of Figure 3.2 show the NAP at 2 ms intervals as it flows from the 1.0-kHz channel. The flow proceeds from right to left as if produced by a chart recorder. The bottom subfigure shows that the NAP returns to its original form after 8 ms, one complete cycle of the sound. (The left-to-right flow preserves the traditional orientation of the impulse response of the system in time; that is, the sharp onset is on the left and the decaying tail is on the right.) It is assumed that there is a short term buffer in the auditory system to store the NAP produced in each frequency channel and that the subfigures show the contents of the buffer for the 1.0-kHz channel. In the computational model, this means one buffer per auditory filter. As the neural activity flows through the buffer its level decays linearly with time, fading away completely in 40-80 ms depending on the initial level.
The strobe unit is assumed to operating at a point several milleseconds into the buffer from the right hand edge; that is, from the time when the NAP is generated and enters the NAP buffer. For convenience, time is measured with respect to time at the strobe unit. When the unit identifies a local maximum, a copy of the NAP buffer is transferred to a static image buffer and summed point for point with any activity that is already there. The auditory image is the complete set of static image buffers for all of the frequency channels. The subfigures on the right side of Figure 3.2 show the state of this channel of the auditory image at 2 ms intervals. It is assumed that the auditory image decays exponentially in time with a half-life of about 15 ms. After the integration of the NAP on the left in the top row into the auditory image, the state is as shown on the right in the top row. As time progresses from this moment on to 6 ms (the next three subfigures) the auditory image simply decays to about 3/4 of its initial level. There is no horizontal motion in the auditory image due to the passage of time; that is what is meant by a static image buffer. After another two milliseconds the strobe unit encounters another local maximum and temporal integration occurs again. At this point the form of the function in the image buffer could change, as it does when the sound is not periodic. However, the NAP of the click train is periodic and its maxima are periodic, so the new NAP function has activity in the same places as it predecessor and all that changes in the auditory image is the overall level of the function.
When the sound first comes on, successive copies of the first impulse response in the NAP appear in successive positions in the auditory image as the first impulse response proceeds across the NAP; but whereas the activity is continuous in the NAP, it is discrete in the auditory image, with each successive copy of the impulse response appearing in place as each successive local maximum is identified. It is important to realise that the rate of flow in the NAP is very high; indeed, if Figure xxx were a dynamic window and the NAP were a multi-channel chart recording, the paper would flow past the window at the rate of xxx meters/sec. Thus, a dynamic display of the NAP would never show anything but a grey blurr if the time scale is sufficiently expanded to reveal the temporal finestructure of the NAP. With regard to the auditory image, this means that, although the image fills from right to left, it happens so quickly that sounds with abrupt onsets appear out of the floor, in place, without apparent lateral motion. Similarly, when a sound ends abruptly, the auditory image decays away in place without lateral motion.
QTI and the boundary between transients and tones: The next example illustrates the construction of the full auditory image and explains the reasons why a click train is sometimes heard as a stream of isolated transients and sometimes as a tone. It is an extended, example involving the auditory images we hear as the rate of a click train is increased, by doublings, from 1 click per second up to 256 clicks per second (cps). At lower click rates (below about 8 cps) we hear the sound as a sequence of isolated clicks; at high rates (above about 32 cps) we hear it as a musical note; in between there is a transition region where flutter and flicker dominate the perception. The example shows that the QTI process generates appropriate auditory images of isolated transients, fluttering tones and stable tones as the click rate increases, provided the memory limit of the NAP buffer is in the range 50-100 ms. This same parameter largely determines the maximum rate for perceiving isolated clicks, the width of the auditory image, and the lower limit of pitch in this model.
At low click rates, where the clicks in the train are separated by more than 100 ms, each click causes the cochlea simulation to produce an isolated 'impulse response' in every channel of the system and the response decays away before the next click occurs. The complete set of impulse responses is the impulse response of the system. As the component of the impulse response in each channel passes its strobe unit, the largest NAP pulse initiates temporal integration and the components of the multi-channel impulse response are transferred one after another from the NAP to the auditory image. A comparison of the NAP and image buffers at a moment during this process is shown in Figure 3.3. The time-interval scale has been expanded to emphasise the order of events. Maxima occur a few milleseconds earlier in the high frequency channels of the NAP and at the moment shown some of the lower channels have yet to reach their maxima and initiate integration. The auditory image of the click will be completed over the next 2 milleseconds. Time intervals in the auditory image are measured from maxima in the NAP and so the component impulse responses are aligned above the zero millesecond point on the time-interval axis. Presenting the process at this instant emphasises the point that QTI operates on a channel-by-channel basis and that it is this aspect of the transformation that aligns the channels in the auditory image and reders us insensitive to global phase changes (see Patterson, 1987xxx for a review). The complete auditory image of the impulse response of the system is the auditory figure of a click. Thus, in the case of isolated transients, the auditory image of the sound and the auditory figure of the sound are the same.
The auditory image decays exponentially in the auditory model and the half-life is about 15 ms, so the auditory figure of the click fades away very rapidly. Note, however, that it does not move laterally as it fades away; it simply recedes into the floor in position. Thus, for isolated transients, and trains of transients with low repetition rates, the activity in the auditory image is limited to the region of the strobe-point vertical, and the perception is one of an isolated transient or a regular stream of temporally isolated transients. The description applies to click trains with rates up to 8 cps; beyond about 4 cps, it becomes difficult to count the individual clicks but the perception is still one of isolated clicks and nothing else.
When the rate of the click train doubles from 8 to 16 cps, the clicks begins to fuse and a new, low, fluttering component is heard in the sound. In the model, this change occurs because the time between clicks decreases to the point where the next impulse response appears in the NAP buffer before its predecessor has completely faded away. As the components of the new impulse response flow past their respective strobe units and initiate integration, copies of the components of the preceeding impulse response are integrated into the auditory image along with the copies of the components of the new impulse response. The NAP and auditory image of a 16 cps click train are shown in Figure 3.4 at the moment when temporal integration has just been completed in the lowest channels. In the NAP, the previous impulse response has moved across by 62.5 ms and has largely faded away. But enough remains to form a small click figure in the auditory image. The time between clicks is about four times the half-life of the auditory image and so there is very little trace of this click figure in the auditory image when the next click occurs. But when it occurs, another copy of the click figure appears at the same position in the auditory image. It is the appearance of this extra click figure at a fixed position in the image that corresponds to the new component in the perception. The fact that the time between clicks is long with respect to the decay rate of the auditory image is what causes the strong fluttering perception associated with click rates in the region around 16 cps; the system notes that the level of the extra click figure is not fixed.
With regard to the boundary of transient sounds, the auditory image of an isolated transient is one solitary auditory figure at the strobe point; the rest of the image is empty. In the current model this occurs when the time between clicks is more than 80 ms because this is the limit of the NAP-buffer memory. This same value sets the limit on the width of the auditory image since a figure has to be in the NAP at strobe time to enter the auditory image. Soft sounds fall out the NAP slightly earlier than loud sounds, and so level has an affect on the transient boundary and on auditory image width, but the effect is small because of the compression of level at the output of the filterbank.
As the pulse rate doubles from 16 to 32 cps, the distance between impulse responses in the NAP halves, an additional click figure appears in the auditory image, and the rate of integration from the NAP into the image doubles. At this point most listeners hear a low pitch the sound if it is presented at a reasonably high level. (For reference, the lowest note on the piano keyboard is intended to be 27.5 cps.) In general, for each successive doubling of the click rate beyond 16 cps, the number of click figures in the region away from the strobe point doubles, and the integration rate doubles until the rate of integration reaches a limit probably in the range 100 - 200 cps. The auditory image for the 64 cps click train is shown in Figure 3.5 and it has the same level scale as Figure 3.4. The leftmost click figure in Figure 3.5 occupies the same position as the one on the left in Figure 3.4. It is a little larger than its counterpart in Figure 3.4 because because it contains a remnant of activity from previous clicks. xxx check this argument xxx The main difference, however, is the shape of the lower half of the click figure. In these channels the ringing of the auditory filters is so long that one click has not completely died away when the next arrives. The result is a small interaction which modifies the impulse responses in the individual channels slightly such that the peak level moves to the second pulse in the set that defines the impulse response. The QTI mechanism strobes on the largest peak in the set and positions that peak at the zero point in the auditory image, so the auditory figure developes a ridge of pulses to the left of the main vertical and loses a ridge of pulses on the right of the auditory figure. xxx check this argument xxx
With regard to the boundary of tonal sounds, the auditory image of a tonal sound is a set of regularly spaced auditory figures that decrease in size from right to left across the image. If the maximum width of the auditory image is about 80 ms, and if tonal sounds are characterised by the presence of two or more evenly spaced auditory figures in the region away from the strobe point, then QTI predicts the lower limit for pitch will be around 25 cps (two periods of 25 cps is 80 ms). This is in good agreement with the data of xxx and xxx (xxxx). The limit will show some sensitivity to the level of the sound, but here, as with the bound on isolated transients, the memory limit of the NAP will be the dominant factor .
In the region between the upper bound on the perception of isolated transients, say 12 cps, and the lower bound on the perception of pitch, say 25 cps, the perception of the click train is dominated by flutter. As the click rate rises, the flutter component of the perception subsides, presumably because the image is being refreshed more frequently so the level of the image rises and the relative size of the increase at the time of integration decreases.
In summary, by quantising the temporal integration process and synchronising it to the NAP, it is possible to stabilise the pitch and timbre information of periodic soundss; that is, to fix the position of the shape information and combine it with successive versions in a representation that preserves the resolution of the NAP. When the period of the sound is short, say less than 8 ms, the strobe rate is high, and in this case, the loudness information is stabilised as well as the pitch and timbre information. When the period of the sound is long, say more than 16 ms, the level fluctuates because the image has time to decay significantly between strobe pulses, and in this case the sound is heard to flutter. It is interesting to note that flutter is perceived in hearing across a region of repetition rates similar to those where flicker is perceived in vision (10 xxx - 50 xxx cps).
Auditory figure components: Implets and Sinlets
The components of auditory figures that arise in individual channels of the auditory image are restricted in the forms they can take. This is a direct consequence of the narrowband filtering imposed by the cochlea and the fact that sounds which give rise to auditory figures are either transients or tones. There are two figure components that are especially common and they are the response of an individual channel to the sound that occupies the minimum space on the time dimension, an acoustic click, and the sound that occupies the minimum space on the frequency dimension, a sinusoidal tone. The figure components are introduced here because they are particularly well exemplified in the auditory figure of the click train; indeed, it is composed of nothing other than these two figure components and blends thereof.
Examples of the two kinds of figure component are presented in Figure 3.6. The upper panel (Figure 3.6a) shows the sequences of figure components that arise in channels centred at 2000 and 4000 Hz. The auditory filter responses for the same stimulus were presented in Figure 1.4. The figure components are obviously sets of filter impulse responses as they appear in the auditory image after feature enhancement alignment and temporal integration. Each of the figure components in Figure 6a is a set of regularly spaced pulses; the time between pulses is the period of the centre frequency of the channel; the time between the peaks of adjecent figure components is the period of the sound. Across channels, figure components of this form differ chiefly in the pulse spacing and the width of the figure component, or the rate at which the pulses drop off after the peak. These properties are determined by the action of the auditory filters and we could expect a central recognition system to anticipate these progressive differences. Since a vertically-aligned set of these figure components indicates that the system has encountered an acoustic impulse, the figure components will be referred to as 'implets', meaning components of a response to an impulse.
The lower panel (Figure 3.6b) shows the figure components that arise in channels centred at 250 and 500 Hz. They are obviously the remnants of a sinusoidal tone and a slightly modulated sinusoidal tone, respectivly, as they appear in the auditory image after feature enhancement alignment and temporal integration. The auditory filter responses were presented in Figure 1.5. Each of the channels in Figure 3.6b is a set of regularly spaced pulses, but in this case, the time between pulses is the period of an isolated harmonic of the click train (the second and fourth harmonics, respectively), rather than the centre frequency of the channel. The harmonic frequency dominates because the low-frequency filters are narrow and so their impulse response is long with respect to the period of the click train (8 ms). Across channels, in the region dominated by a single harmonic, the figure components differ chiefly in the overall level of the activity in the channels, and between harmonics there are empty channels. The pulse spacing is fixed both within and across all the channels associated with one harmonic. Since a vertically-aligned set of these figure components indicates that the system is being driven, locally, by a sinusoid, these figure components will be referred to as 'sinlets', meaning components of the response to a sinusoid. The sinlets differ conceptually from the implets in that they are continuous across the auditory figures in the image. However, these properties are also determined by the action of the auditory filters and we could expect a central recognition system to anticipate these figure properties as well.
As the centre frequency rises above 500 Hz and the auditory filter broadens, the modulation on the sinlet increases and becomes progressively more asymmetric. As the centre frequency decreases below 2000 Hz and the auditory filter narrows, the tail of the implet grows and the slope becomes shallower. Eventually, the tail duration exceeds the period of the sound and there is a region where the figure component could be considered either an implet with a supra-period tail or a highly modulated sinlet. The auditory figures in the remainder of the monograph are largely composed of implets and tonelets. For example, the upper formants of the vowels in Figure 0.2a are sets of implets which have steep slopes on the flanks of the formant and shallower slopes in the centre. The first formant is composed of of sinlets; the temporal harmonic relationships reveal which sinlets come from the same source.
The strobe mechanism
The strobe mechanism is relatively simple. There is a strobe unit assigned to each channel and it monitors the NAP as it flows past looking for local maxima in the stream of NAP pulses. When a local maximum is encountered the entire contents of the NAP of that channel are transferred to the auditory image and added pint for point with the current contents of the image. The peak of the NAP pulse at the local maximum is assigned a time-interval value of zero ms; that is, times in the NAP function are converted to time intervals in the image by subtracting the time of the local maximum. Thus, the problem of when to initate temporal integration reduces to one of determining the positions of the local maxima in the individual NAP functions. This is accomplished as follows: Each strobe unit maintains an adaptive threshold for each channel and it is initially set to zero. When a NAP pulse exceeds threshold the time and height of the peak of the pulse are noted, a strobe-lag clock is set to n milliseconds, and the threshold is set to the peak height plus a proportion of the peak height whose value is described in the next paragraph. Monitoring continues for a further n milleseconds to determine whether another pulse exceeds threshold during this lag, and if so, whether the new pulse is larger than its predecessor. If such a pulse is encountered, its time and height replace those of the previous candidate for the local maximum and the strobe-lag clock is reset to n milliseconds. When the clock times out, integration is initiated with the current peak time and height as the local maximum.
The strobe threshold decays linearly with time and is expressed as a percentage decay per millesecond. The linear decay provides the best match to the curvature of the decay of the filters impulse response as it appears in the NAP after logarithmic compression. An exponential decay gives undue emphasis to pulses in the middle of a decaying impulse response relative to those later on. (Why is there a bunt at all? How does this affect the Figure discontinuity seen in Figure 3.5?)
The duration of the strobe-lag is shown in the auditory image as the largest negative time-interval in the image. In effect, this means that the strobe mechanism monitors the NAP at a point n milliseconds inside the NAP buffer. In the monograph the strobe lag is typically 5 ms. ( xxx why? xxx) In the auditory image, longer time intervals are plotted on the left of the auditory image and shorter time intervals on the right to ensure that the left-to-right temporal orientation of figure information in the basilar membrane motion and the NAP is preserved in the auditory image.
[The question arose at this point as to whether the bunt does something that is important enopugh to warrant its description and possibly its existence in the routine. xxx]
Whenever a pulse exceeds threshold, it is reset to the height of the peak of the pulse plus a small proportion of that height -- the bunt. The threshold value is then reset to a value somewhat above the height of the current NAP peak and The threshold value decays linearly with time until another pulse is encountered. The bunt value is determined by the height of the most recent peak and decay rate for the strobe threshold; the bunt increases with the decay rate.
General purpose pitch mechanisms based on peak picking are notoriously difficult to design (Rabiner et al 1976?) and the strobe mechanism just described above would not work especially well on an arbitrary acoustic waveform. The reason that this simple local maximum finder is sufficient for auditory temporal integration is that NAP functions are highly constrained. The microstructure invariably reveals a pulse that rises from zero to a peak smoothly and returns smoothly back to zero where it stays for more than half of the period of the centre frequency of that channel. On the longer time scale, the amplitude of successive peaks changes only relatively slowly with respect to time. As a result, for periodic sounds there tends to be one clear maximum per period in all but the lowest channels where there is an integer number of maxima per period. The simplicity of the NAP functions follows from the fact that the acoustic waveform has passed through a narrow band filter and so it has a limited number of degrees of freedom. In all but the highest frequency channels, the output of the auditory filter resembles a modulated sine wave whose frequency is near the centre frequency of the filter. Thus the NAP is largely restricted to a set of peaks which are modified versions of the positive halves of a sine wave, and the remaining degrees of freedom appear as relatively slow changes in peak amplitude and relatively small changes in peak time (or phase).
? Summary ?
The auditory image model is now sufficiently complete to shift the emphasis of the monograph away from image construction per se and towards the analysis of complex sounds like those encountered in speech. There is one further transform that is important for understanding pitch and octave perception, and therefore, for understanding harmony but it will be introduced later with the discussion of musical sounds. The discussion of complex sounds begins in the next Chapter with vowel sounds since they are familiar and the reader carries the apparatus for generating acoustic examples to accompany the discussion.
Figures Legends for Chapter 3
Figure 3.1. The NAP of the vowel in 'mat'.
Figure 3.2. The auditory image over the course of one cycle of a click train. The centre frequency of the channel is 1.0 kHz.
Figure 3.3. The NAP and auditory image of a click during the moment moment of temporal integration.
Figure 3.4. The NAP and auditory image of a 16 cps click train just after temporal integration is completed.
Figure 3.5. The NAP and auditory image of a 64 cps click train just after temporal integration is completed.
Figure 3.6. Figure components from the auditory image of the click train: a) Implets from channels centred at 2000 and 4000 Hz and b) sinlets from channels centred at 250 and 500 Hz.