The Auditory Image and Auditory Figures
From CNBH Acoustic Scale Wiki
Auditory perceptions are assembled in the brain from sounds entering the ear canal, in conjunction with current context and information from memory. It is not possible to record or measure perceptions directly, so any description of our perceptions must involve an explicit, or implicit, model of how perceptions are produced by a sensory system in conjunction with the brain. This chapter focuses on the initial auditory images that we hear when presented with sounds, and it presents an computational model of auditory peception that shows how the auditory image might be constructed in the brain. The auditory image can be thought of as a window into the space of auditory perception. The Auditory Image Model (AIM) allows us to illustrate the auditory image concept and the auditory figures and events that we experience in response to communication sounds.
Contents |
The Auditory Image: a window into the space of auditory perception
To account for everyday experience, it is assumed that sensory organs, and the neural mechanisms that process sensory data, together construct internal, mental models of objects as they occur in the world around us. That is, the visual system constructs the visual part of the object from the light the object reflects and the auditory system constructs the auditory part of the object from the sound the object emits, and these two aspects of the mental object are combined with any tactile and/or olfactory information, to produce our experience of an external object. The task of the auditory neuroscientist is to characterize the auditory part of this object modeling process. This subsection sets out some of the assumptions and terms required to discuss the internal representations of sound. The focus of the perceptual part of the book is one particular internal representation referred to as the auditory image. The model of how these auditory images are constructed by the brain is referred to as AIM (the auditory image model), and the same acronym is used to refer both to the conceptual model of what takes place in the brain and a computational version of this conceptual model which is used to illustrate how we think the auditory system might perform the computations required to create our initial auditory images of sounds.
We assume that the sub-cortical auditory system creates a perceptual space, in which an initial auditory image of a sound is assembled by the cochlea and mid-brain using largely data-driven processes. The auditory image and the space it occupies are analogous to the visual image and the space that appears when you open your eyes after being asleep. If the sound arriving at the ears is a noise, the auditory image is filled with activity, but it lacks organization, the fine-structure is continually fluctuating, and the space is not well defined. If the sound has a pulse-resonance form, an auditory figure appears in the auditory image with an elaborate structure that reflects the phase-locked neural firing pattern produced by the sound in the cochlea. Extended segments of sound, like syllables or musical notes, cause auditory figures to emerge, evolve, and decay in what might be referred to as auditory events, and these events characterize the acoustic output of the external source. All of the processing up to the level of auditory figures and events can proceed without the need of the top-down processing associated with context or attention. For example, if we are presented with the call of an animal that we have never encountered before, the early stages of auditory processing will produce an auditory experience with the form of an event, even though we, as listeners, might be puzzled about how to interpret the event. It is also assumed that auditory figures and events are produced in response to sounds when we are asleep, and that it is a subsequent, more central process that evaluates the events and decides whether wake in response to the sound.
When we are alert, the brain may interpret a auditory event in conjunction with events occurring in other sensory modalities at the same time, and in conjunction with contextual information and memories, and so assign meaning to a sequence of perceptual events. If the experience is largely auditory, as with a melody, then the event with its meaning might be regarded as an auditory object, that is, the auditory part of the perceptual model of the external object that was the source of the sound. A detailed discussion of this view of hearing and perception is described in An introduction to auditory objects, events, figures, images and scenes . Examples of the main concepts are presented in the remainder of this introduction.
The Auditory Image
The components of everyday sounds fall naturally into three broad categories: tones, like the hoot of an owl or the toot of a horn; noises, like wind in the trees or the roar of a jet engine; and transients, like the crack of a breaking branch, the clip/clop of horse's hooves, or the clunk of a car door closing. The initial perceptions associated with the mental images aroused by these examples are what is meant by the term "auditory image". The auditory image is the first internal representation of a sound of which we can be aware. We cannot direct our brain to determine whether the eardrum is vibrating at 200 Hz, or whether the point on the basilar membrane that responds specifically to acoustic energy at 200 Hz is currently active, or whether neurons in sub-cortical nuclei of the auditory pathway that are normally involved in processing activity from the 200-Hz part of the basilar membrane are currently active. What we experience is the end product of a sequence of physiological processes that deals with all of the acoustic energy arriving at the ears in the recent past. The summary representation of recent acoustic input of which we are first aware is the auditory image. This summary image is also presumed to be the basis of all subsequent perceptual processing. There are no other auditory channels for secondary acoustic information that might be folded into our experience of the sound at a later point. The auditory image is the only auditory input to the central auditory processor that interprets incoming information in the light of the current context and elements of auditory memory.
The three panels of Figure 2.1.1 present simulated auditory images of a regular click train with an 8-ms period (a), a white noise (b), and an isolated acoustic pulse(c), and each is, arguably, the simplest broadband sound in its respective category (tone, noise, or transient). The click train sounds like a buzzy tone with a pitch of 125 Hz. It generates an auditory image with a sharply-defined vertical structure that repeats at regular intervals across the image. Moreover, the temporal fine-structure within the structure is highly regular. These are the characteristics of communication tones as they appear in the simulated auditory image. The white noise produces a "sshhh" perception. It generates activity throughout the auditory image but there are no stable structures and there is no repetition of the occasional transient feature that does arise in the activity. Moreover, the micro-structure of the noise image is highly irregular, except near the 0-ms vertical towards the right-hand edge of the image. These are the characteristics of noisy sounds in the auditory image. The acoustic pulse produces a perceptual "click." It generates an auditory image that is blank except for one well-defined structure attached to the 0-ms vertical at the right-hand side of the image. In summary, the attributes illustrated in the three panels of Figure 2.1.1 are the basic, distinguishing characteristics of the auditory images of tones, noises and transients.
Many of the auditory images we experience in everyday life are fairly simple combinations of tones, noises and transients. For example, from the auditory perspective, the syllables of speech are communication tones (referred to as vowels) that are made more distinctive by attaching speech noises (referred to as fricative consonants), speech transients (referred to as plosive consonants), and mini-tones (referred to as sonorant consonants). The three panels of Figure 2.1.2 present simulated auditory images of the speech tone /ae/, the speech noise /s/ and the speech transient /k/ (letters set off by slashes indicate the sound, or phoneme, associated with the letter, or combination of letters). They are the auditory images for the phonemes of the word "ask". The banding of the activity in the image of the vowel (Figure 2.1.2a) makes it distinguishable from the uniform activity in the image of the click train (Figure 2.1.1a), and these differences play an important part in distinguishing the /ae/ as a speech sound. Nevertheless, the auditory images of the click train and the vowel are similar insofar as both contain a structure that repeats across the width of the auditor image. The structure is distinctive and, in this example, the spacing of the structures is the same for the two images. Similarly, the distribution of activity for the fricative consonant, /s/ (Figure 2.1.2b), is restricted with respect to the activity of the white noise (Figure 2.1.b), but the images are similar insofar as there are no stable structures in the image, the fine-structure is irregular, and the activity spans the full width of the image. Finally, the distribution of activity for the plosive consonant, /k/ (Figure 2.1.2c), is restricted with respect to the activity of the acoustic pulse (Figure 2.1.c), but the images are similar insofar as they are largely empty and the activity that does appear is attached to the 0-ms vertical of the image and has regular fine-structure.
Music is also largely composed of tones, noises and transients. Musical instruments from the brass, string and woodwind families produce sustained communication tones, and sequences of these tones are used to convey the melody information of music. The auditory images of the individual notes bear an obvious relationship to the auditory images of vowels and click trains. For example, the auditory image of a note from the mid-range of a bassoon is like a cross between the images produced by the click train and the vowel of “mask” insofar as the vertical distribution of activity in the bassoon image is more extensive than that in the vowel image but less extensive than that of the click train image. A segment of a rapid role on a snare drum is a musical noise, whose auditory image is similar to that of the white noise. A single tap on the percussionist’s wood block is a good example of a musical transient. It produces a slightly rounded version of the transient structure at the 0-ms time interval.
This Part of the book explains how auditory images might be constructed by the auditory system and why the three different classes of sounds produce such different auditory images. It is argued that auditory image construction evolved to segregate sounds into the three main categories automatically, and to present sounds in a form that is suitable for analysis by the auditory part of the brain.
Auditory Figures
The static structures that communication tones produce in the auditory image are referred to as auditory figures because they are distinctive and they stand out like figures against the fluctuating activity of noise which appears like an undistinguished background in the auditory image. The click train and the vowel in ‘mask’ are good examples of communication tones that produce prominent perceptions with strong pitches and distinctive timbres. The auditory figures that dominate the simulated auditory images of these sounds (Figures 2.1.1a and 2.1.2a) are elaborate structures, and their similarities and differences provide a basis for understanding the traditional attributes of pitch and sound quality. The auditory images in Figures 2.1.1a and 2.1.2a are similar inasmuch as they are both composed of an auditory figure that repeats at regular intervals across the image, and the interval is roughly the same for the two sounds. The presence of pattern on a large scale in the auditory image is characteristic of tonal sounds. The horizontal spacing of the repeating figure that forms the pattern is the period of the sound and it corresponds (inversely) to the pitch that we percieve; the buzzy tone and the vowel have nearly the same figure spacing and nearly the same pitch. When two or more tonal sounds are played together, the patterns of repeating auditory figures interact to produce compound patterns that explain the basics of harmony in music, that is, the preference for the musical intervals known as octaves, fifths and thirds. In this model, then, musical consonance is determined primarily by properties of peripheral auditory processing and only secondarily by cultural preferences.
The auditory images of the click train and vowel are different inasmuch as the auditory figure of the vowel has a more complex shape and it has a rougher texture. The shape and texture of the simulated auditory image capture much of the character of the auditory figure that we hear; that is, the sound quality or timbre of the sound. The banding of the activity in the vowel figure is largely determined by the centre frequencies and bandwidths of resonances in the vocal tract of the speaker at the moment of speaking. In the absence of these vocal resonances, a stream of glottal pulses generates an auditory figure much like those produced by a click train, only somewhat more rounded at the top and bottom. The resonances are referred to as ‘formants’ in speech research and the shape that they collectively impart to the figure identifies the vowel in large part. The texture of the auditory figure is primarily determined by the degree of periodicity of the source in the individual channels of the image. The texture of the formants in the auditory figure -- the degree of definition in the simulated image -- provides information concerning whether the speaker has a breathy voice, whether the consonants in the syllable are voiced, and whether the syllable is stressed or not.
Transient sounds also produce auditory figures. They do not last long and the auditory system often misses the details of the shape and texture when an isolated transient occurs without warning. But if a transient is repeated, as when a horse walks slowly down a cobbled street, the individual shoe figures become sufficiently distinct to tell us, for example, if one of the shoes is loose or missing.
Noises do not produce auditory figures. The auditory images of noises have a rough texture and there are no stable patterns. A noise may be not entirely random, either in the temporal or the spectral dimension, and still be a noise. In such cases, we hear that the noise is different from a continuous white noise; for example, a noise may have more hiss than 'sssshhh' indicating the presence of relatively more high-frequency energy than low-frequency energy; or it may have temporal instabilities that we hear as rasping, whooshing, or motor-boating. But the noise does not form auditory figures in the sense of structures with a stable, internal fine-structure; the auditory image of a noise is constantly changing and the texture is rough rather than smooth.
The Construction of the Auditory Image
It is assumed that the purpose of the peripheral auditory system, in mammals at least, is to construct an representation of sound in preparation for pattern analysis and source identification by more central auditory processes. Specifically, AIM is intended to explain how is converted into a our initial auditory image of the sound. For example, when a humming bird hovers behind a nearby bush or a sachbut plays a note in the next room, we experience an auditory image of the sound even if we had never heard a humming bird or a sachbut before, and even if we know nothing about humming birds or sachbuts. This is the part of the hearing process that AIM is intended to simulate. The pattern analysis and source identification performed by more central auditory processes, in conjunction with contextual and semantic knowledge concerning birds and music, are currently beyond the scope of AIM.
The important processes in AIM are the initial spectral and temporal analyses performed by the auditory preprocessor: The cochlea performs a frequency analysis of incoming sound and in so doing sets up the tonotopic dimension of auditory perception. Then, the mid-brain analyses the time-interval information in the individual frequency channels to extract information about the location of the source and to create the time-interval dimension of auditory perception. Generally speaking, any model where the internal representation of sound has dimensions of tonotopy and time-interval can be considered to be an auditory image model, and the representation it produces is an auditory image. This includes the autocorrelation models of Licklider (1951), Slaney and Lyon (1990), Meddis and Hewitt (1991) and Yost et al. (1996) as well as models that specifically use the phrase "auditory image," for example, (Patterson et al., 1992; Patterson et al., 1995; Patterson et al., 1996).
Conceptually, the construction of the initial auditory image occurs in three stages, each of which imparts a dimension to the space of auditory perception. The first stage is frequency analysis which is applied to the incoming sound in the cochlea by the basilar membrane and outer hair cells; it produces the tonotopic dimension of the auditory image shown vertically in Figures 2.1.1 and 2.1.2. This stage includes neural transduction of the basilar-membrane motion by inner hairs along the edge of the basilar membrane, and the sharpening of the neural activity pattern (NAP) in the cochlear nucleus (CN). The second stage is laterality analysis performed on the neural patterns in pairs, one frequency-matched channel from each CN; this analysis is performed in the superior olivary complex (SOC) and it determines where the auditory image is located relative to the head of the listener. The third stage is time-interval analysis, which involves the measurement of time intervals within the individual channels of the NAP, and the construction of a form of dynamic, interval histogram, one for each channel of the NAP. The array of dynamic interval histograms with its location is the auditory image. It is postulated that the time-interval histograms are assembled in the inferior colliculus (IC), but at this point, this is just a logical assumption.
For purposes of discussion, it is argued that the auditory image produced by the preprocessor is the first internal representation of a sound of which we can be aware, and it is the representation of sound that forms the basis of all subsequent processing in the auditory brain. Auditory models that emphasize this particular internal representation of sound and its pivitol position in the analysis of sound, are commonly referred to as Auditory Image Models.
The Tonotopic Dimension of the Auditory Image
The first stage of image construction involves the spectral analysis performed on a sound entering the ears at a given point in time. A schematic overview of this analysis, and the arrival-time analysis that follows it, are presented in Figure 2.1.3. The example is intended to illustrate how the auditory system deals with two simultaneous sources, one 40 degrees to the right of the listener and the other 20 degrees to the left of the listener. The source on the right contains only mid-frequency components; the source on the left contains low and high frequency components but no mid-frequencies. The icons (1) at the top of the figure are intended to represent the combined sound waves from the two sources entering the left and right outer ears, or pinnae. The outer ear and the middle ear concentrate sounds and increase our sensitivity to quiet sounds, but they do not analyse the sound or change the dimensionality of the representation as it passes through these structures. In the main panel of the figure, the sound enters the inner ear (2), or cochlea, as shown by the vertical arrows entering the vertical tapered tubes. The tapered tube represents the cochlea as it would be if it were unrolled; the bold, dashed line down the middle of the tapered tube represents the basilar partition which runs down the middle of the cochlea. It contains the basilar membrane which initiates the spectral analysis which is assisted by the outer hair cells mounted along the lenght of the basilar membrane and the tectoriam membrane above them.
In the cochlea, the components of an incoming sound are sorted according to frequency and the results are set out along the membrane in to produce a quasi-logarithmic, or "tonotopic" frequency axis, shown by the frequency scale beside the left cochlea. This tonotopic axis is the frequency dimension of heaaring, which appears as the vertical dimension of the simulated auditory images shown in the panels of Figures 2.1.1 and 2.1.2.
A variety of computational models of cochlear processing were developed in the 1980’s and 1990’s when computers became generally available to simulate the complex neural activity patterns that arise in the auditory nerve in response to broadband sounds like speech and music (Lyon, 1982, 1984; Seneff, 1988; Shamma, 1988; Deng, Geisler and Greenberg, 1988; Ghitza, 1988; Holdsworth et al., 1987; Patterson et al., 1988; Assmann and Summerfield, 1990; Meddis and Hewitt, 1991a). In each case, the cochlea simulated with an auditory filterbank which simulates the motion of the basilar partition, and some form of compressive adaptation mechanism which simulates neural transduction. In these, and subsequent, auditory models, the bandwidth of auditory filter increases quasi-logarithmically with the centre frequency of the filter, and the filters are typically distributed across frequency in proportion to the “equivalent rectangular bandwidth” (ERB) of the filter (Patterson, 1976).
Frequency analysis in the computational version of AIM
In the computational version of AIM, the frequency analysis performed by the basilar partition is simulated by a bank of "gammatone," or "gammachirp," auditory filters (Patterson et al., 1992; Irino and Patterson, 2001). Figure 2.1.4 shows the output of a 75-channel, gammatone auditory filterbank in response to four cycles of the vowel /ae/, as in the word 'hat'. Each line in the figure shows the output of an individual auditory filter. The ordinate shows the centre frequency of each filter, that is, the frequency to which it is most responsive. As the energy in the stimulus moves away from the center frequency, the response of the filter becomes progressively smaller. So each of the lines in the figure summarises the activity in a given frequency region. The surface defined by the full set of lines in Figure 2.1.4 is AIM's simulation of the basilar membrane motion (BMM) produced as a function of time in response to this specific vowel. The basilar membrane is observed to oscillate relatively slowly in low-frequency channels and relatively quickly in high-frequency channels, which is, in general terms, what is meant by "frequency". Note, however, that in most channels the oscillation is relatively complex (i.e. not sinusoidal). The use of auditory filters to simulate BMM is described in Chapter 2.2. At this point, it is sufficient to note how the glottal pulses and vocal resonances of vowels appear in BMM.
The glottal pulse rate for this /ae/ vowel is close to 125 Hz, so every 8 ms, the BM is hit by another glottal pulse. As a result, there is an abrupt increase in the level of activity in each channel every 8 ms, and the pattern of BMM activity repeats with a period of about 8 ms (1/125-Hz). The bandwidth of the auditory filter increases in proportion to its centre frequency, and the rate at which a filter can respond to a pulse depends on filter width (as explained in Chapter 2.2). This is the reason that the response to the pulse is relatively slow in low-frequency channels, and it is the progressive increase in bandwidth that produces what appears to be a progressive lag in filter response at low frequencies.
There are three concentrations of activity in the response of the BM to this vowel, centred on ERBrate values of 13-14, 19-20, and 23 in Fig. 2.2.4. These concentrations of activity reveal the effect of the resonances in the speaker's vocal tract on the speaker's glottal pulses. In speech research these resonances are referred to as formants, designated F1, F2 and F3. The upper formants (F2 and F3) appear as sequences of impulse responses with the longest responses occurring in the centre of the formant where the level is greatest, and the shortest occurring between formants, where the activity of adjacent formants often interacts. In the region of the first formant where the filters are relatively narrow, the formant is seen to ring on into the next glottal cycle and there is a complicated phase interaction in the waves as the system responds to the next glottal pulse.
At this point, introduce the concepts of compression and suppression and filter shape, which will be considered in Ch 2.2 and its appendicies.
Complex sounds, like the sounds of speech and music, have a broad range of frequencies, and in this case the output of each filter is dominated by energy in the filter's passband, that is, the band of frequencies immediately adjacent to the centre frequency where the response is strongest. As a result, the BM response to pulse resonance sounds is dominated by passband activity and it does not matter too much [[[For complex sounds, the pattern of BMM is dominated ...]]]
Neural transduction in the computational version of AIM
Together, they convert the sound wave entering the cochlea into a two-dimensional simulation of the complex neural activity pattern produced by the cochlea in the auditory nerve in response to an input sound. The auditory filterbank, the scale of the frequency dimension, and the properties that the filtering imparts to the auditory image are described in Chapter 2.2 ; it explains how the cochlea performs the frequency analysis that creates the tonotopic dimension of the auditory image.
In the lowest frequency channels (ERB 8 and below), the auditory filter is relatively narrow and the output of the auditory filter is effectively sinusoidal in shape; that is, the auditory filters isolate the individual harmonics of the repetition rate of the vowel. The first harmonic (the fundamental) is not apparent in the BMM motion, but close examination of channels with ERBs in the range 6-8 shows activity associated with the second harmonic (250 Hz); the activity is seen to rise and fall twice per cycle of the wave.
[[[To do: Find a brief description of the BMM or the /ae/ in hat and the NAP. Note the difference between /ae/ and /a/.]]
The Location of the Auditory Image
In the next stage of processing, the streams of neural impulses flowing from the two cochleas are compared on a channel-by-channel basis to determine whether the components of the sound in a given frequency region arrived at the left or right ear first, and by how much. This information helps to determine the location of the source of the sound. This arrival-time comparison takes place in the brain stem as the neural pulse patterns pass from the cochlear nucleus, to the inferior colliculus [through the medial nucleus of the trapezoid body (MNTB), the superior olivary complex (SOC), and the lateral lemniscus (LL)]. The basics of the operation of the arrival-time mechanism are illustrated in the central panel of Figure 2.1.3 (although the details have recently been shown to depend as much on neural inhibition as they do on spatial position). In the example, there is one sound source 40 degrees to the right of the listener and another 20 degrees to the left of the listener. Sound form the source on the right arrives at the right ear a fraction of a millisecond sooner than it does at the left ear, and this arrival-time difference is preserved by the spectral analysis performed in the cochlea. After frequency analysis and neural transduction, activity in channels with the same centre frequency travels across the laterality line specific to that pair of channels, and the information travels at the same, fixed rate. In the current example, for the sound on the right, the mid-frequency components from the right cochlea enter their laterality lines sooner than the mid-frequency components from the left cochlea and so the streams meet at a point to the left of the mid-point on the laterality line, and it is the same point in all of the channels. Where they meet specifies the angle of the source relative to the head, as indicated by the values across the top of the frequency-laterality plane. Similarly, the sound from the source 20 degrees to the left of the listener arrives at the left ear a fraction sooner than at the right ear. So, both the high-frequency components and the low-frequency components in this sound meet at a point to the right of the mid-point on the laterality line, and it is the same point for all of the channels excited by the source 20 degrees to the left of the listener.
The coincidence detection process is applied on a channel-by-channel basis and the individual laterality estimates are combined to determine the direction of sources in the horizontal plane surrounding the listener. The mechanism does not reveal the elevation of a source relative to the listener, nor the distance of the source from the listener; it just indicates the angle of the source relative to the head in the horizontal plane. Nevertheless, this laterality information is of considerable assistance in locating sources and segregating sources. The details are presented in Chapter 2.3 .
The view of spectral analysis and laterality analysis presented here is the standard view of auditory preprocessing for frequency and laterality. Together these two process produce two of the dimensions of the space of auditory perception as we experience them -- the tonotopic dimension of the auditory image and the laterality dimension of auditory space perception. Up to this level, AIM is a straightforward functional model of the initial stages of auditory processing, and as such it is relatively uncontroversial.
The Time-Interval Dimension of the Auditory Image
Now consider what we do and do not know about a sound as it occurs at a point on this frequency-laterality plane. We know that the sound contains energy at a specific frequency, say 1.0 kHz, and that the source of the sound is at a specific angle relative to the head, say 40 degrees to the right. What we do not know is whether the sound is regular or irregular; whether the source is a normally voiced vowel or a whispered vowel. The system has information about the distribution of activity across frequency and the distribution of laterality values for those channels. But the distributions for voiced and whispered vowels are quite similar, and if the whispered source were closer than the voiced source, they might have very similar levels. In this case, the prominent difference between the vowels (voiced vs whispered) is contained in the degree of temporal regularity which is not represented in any given channel of the frequency-laterality plane. The crucial information in this case is temporal regularity; the temporal fine-structure of the filtered waves flowing from the cochlea is highly regular in the case of a bassoon note and highly irregular in the case of washing-machine noise. Information about fine-structure regularity is not available in the frequency-laterality plane. It is a representation of sound in the auditory system prior to temporal integration; as such it represents the instantaneous state of the sound and it does not include any representation of the variability of the sound over time.
The sophisticated processing of sound quality by humans indicates that peripheral auditory processing includes at least one further stage of processing; a stage in which the components of the sound are sorted, or filtered, according to the time intervals in the fine structure of the neural activity pattern flowing from the cochlea. It is as if the system set up a vertical array of dynamic histograms behind the frequency-laterality plane, one histogram for each frequency-laterality combination. When the sound is regular, the histogram contains many instances of a few time intervals; when the sound is irregular, the histogram contains a few instances of many time intervals.
The measurement of time intervals in the NAP and the preservation of time-interval differences in a histogram adds the final dimension to the auditory image and our space of space of auditory perception. This space is illustrated in Figure 2.1.6, where vertical planes are shown fanning out behind the frequency-laterality surface of Figure 2.1.3. Each vertical plane is made up of the time-interval histograms for all the channels associated with a particular laterality. For a point source, all of the information about the temporal regularity of the source appears in one these planes and that information is the Auditory Image of the source at that moment. For a tonal sound the time intervals are highly regular and there is an orderly relationship between the time-interval patterns in different channels, as illustrated by the auditory images of the click train and the vowel shown in Figures 2.1.1a and 2.1.2a. For noisy sounds the time intervals are highly irregular and any pattern that does form momentarily in one portion of the image is unrelated to activity in the rest of the image, as illustrated by the auditory images of the white noise and the fricative consonants shown in Figures 2.1.1b and 2.1.2b.
Strobed Temporal Integration
In AIM, the time-interval filtering process is performed by a new form of strobed temporal integration motivated originally by a contrast between the representation of periodic sounds at the output of the cochlea and the perceptions that we hear in response to these sounds. Consider, the microstructure of any single channel of the neural activity pattern produced by a periodic sound like a vowel at the output of the cochlea (Figure 2.1.5). Since the neural pattern repeat with each cycle of the sound, the microstructure of the NAP consists of alternating bursts of activity and quiescence. The level of activity at the output of the cochlea must correspond to the loudness of the sound at some level in the auditory system, since an absence of activity corresponds to silence and the average level of activity increases with the intensity of the sound. If the NAP were the internal representation that corresponds to out perception of the sound, we would expect periodic sounds to give rise to loudness fluctuations, since the activity level in the NAP oscillates from the level produced by the glottal pulses and silence. But periodic sounds do not give rise to the sensation of loudness fluctuations. Indeed, quite the opposite, they produce static auditory images with exquisite detail. These observations indicate that some form of temporal integration is applied to the neural activity pattern after it leaves the cochlea and before the system creates the internal representation that forms the basis of our initial perception of the sound. Until recently, it was assumed that simple temporal averaging could be used to perform the temporal integration and, indeed, if one averages over 10-20 cycles of a sound, the output will be relatively constant even when the input level is oscillating. However, the period of male vowels is on the order of 8 ms, and if the temporal duration of the moving average (its integration time) is long enough to produce stable output, it would smear out features in the fine-structure of the activity pattern; features that make a voice guttural or twangy and which help us distinguish speakers.
In AIM, a new form of temporal integration has been developed to stabilise the fast-flowing neural activity pattern without smearing the microstructure of activity patterns from tonal sounds. Briefly, a bank of delay lines is used to form a buffer store for the neural activity flowing from the cochlea; the pattern decays away as it flows, out to about 80 ms. Each channel has a strobe unit which monitors the instantaneous activity level and when it encounters a large peak it transfers the entire record in that channel of the buffer to the corresponding channel of a static image buffer, where the record is added, point for point, with whatever is already in that channel of the image buffer. Information in the image buffer decays exponentially with a half-life of about 40 ms. The multi-channel result of this strobed temporal integration process is the auditory image. For periodic and quasi-periodic sounds, the strobe mechanism rapidly matches the temporal integration period to the period of the sound and, much like a stroboscope, it produces a stable image of the repeating temporal pattern flowing up the auditory nerve. The Stabilized Auditory Image (SAI) created from the NAP of the vowel /ae/ in "hat" (Figure 2.1.5) is presented in Figure 2.1.7.
In AIM, it is strobed temporal integration that performs the time-interval filtering and creates the time-interval dimension of the auditory image. It is a time-interval dimension rather than a time dimension; activity at time-interval, ti, means that there was recently activity in the neural pattern ti milliseconds before a large pulse which initiated temporal integration. The pulse that caused the temporal integration is added into the image at the time interval 0 ms. Neither the strobe pulse nor the pulse separated by ti ms are separately observable in the auditory image; they have been combined with previous pulses that bore the same temporal relationship in the neural activity pattern. The activity in the image is the decaying sum of all such recent events. As time passes the pattern does not flow left or right; when the sound comes on the image builds up rapidly in place and when the sound goes off the image fades away rapidly in place.
Strobed temporal integration is described in Chapter 2.4 where it is argued that STI completes the image construction process, and the dynamic version of the auditory image is the basis of our initial perception of sound. STI is also responsible for determining the basic figure-ground relationships in hearing, and the degree to which we perceive and do not perceive phase relationships in a sound.
Since the model emphasizes the primacy of auditory images, and since it identifies this specific representation of sound as playing a crucial role in auditory perception, the model is referred to as AIM -- the Auditory Image Model.