Making the auditory figure scale-shift covariant
From CNBH Acoustic Scale Wiki
This section illustrates how the expression for auditory perception (Eq. 8) can explain several basic properties of auditory perception. From the perspective of communication, the vowel contains three important components of the information in speech. The first component is the “message,” which is that the vocal tract is currently in the shape that the brain associates with the phoneme /a/. This message is contained in the shape of the resonance, which is the same in every cycle of all four waves of Figure 5. The second component of the information is the glottal pulse rate. In the left-hand column of the figure, an adult has spoken the /a/ with a fast glottal pulse rate (a) and then a slow glottal pulse rate (b). The glottal pulse rate determines the pitch of the voice. The resonances are identical, since it is the same person speaking the same vowel. The third form of information is the resonance rate. In the right-hand column, the same vowel is spoken by a child with a short vocal tract (c) and an adult with a long vocal tract (d) using the same glottal pulse rate. The glottal pulse rate and the shape of the resonance (the message) are the same, but the rate at which the resonance proceeds within the glottal cycle is faster in the upper panel. That is, the resonances of the child ring faster, in terms of both the resonance frequency and the decay rate. In summary, the stationary segments of the voiced parts of speech carry three forms of information about the sender: information about the shape of the vocal tract, its length, and the rate at which it is being excited by glottal pulses.
Irino and Patterson (2002), Patterson, van Dinther, and Irino, 2007), and Patterson et al. (2007) have argued that the robustness of auditory perception to variation in the pulse rate and resonance rate of communication sounds indicates that the auditory preprocessor isolates and normalizes the neural patterns produced by individual cycles of communication sounds in the auditory nerve, and they have shown how the original version of AIM can be extended to produce scale-shift invariant, or covariant, representations of pulse-resonance sounds. Basically, in each channel of the auditory image, the time interval dimension must be dilated by the centre frequency of the channel. This produces a scale-shift-covariant version of the auditory image (sscAI).
The importance of this representation of sound is that it separates the three aspects of vowel information, and shows how each aspect of the information can vary independently of the other two. This is the representation used in the videos of ‘Le mot pipe’ and the communication syllables. The representation is produced with an extended version of the Auditory Image Model (AIM) (Patterson, 1994a,b). The first three stages of AIM are typical of most time-domain models of auditory processing. (1) A band-pass filter simulates the operation of the outer and middle ears. (2) An auditory filterbank simulates the spectral analysis performed in the cochlea by the basilar partition Unoki et al. (2006). (3) The simulated membrane motion is converted into a simulation of the phase-locked, neural activity pattern (NAP) that flows from the cochlea in response to the sound, by compressing, half-wave rectifying and low-pass filtering the membrane motion, separately in each filter channel. The NAP produced in response to the /i:/ in ‘pipe’ is shown in Figure 5a. The dimensions of the NAP are time (the abscissa) and auditory-filter center-frequency on a quasi-logarithmic axis (the ordinate). The NAP shows how the individual cycles of the vowel produce a pattern of vowel activity in the auditory nerve. In AIM, each of these patterns is isolated and used to produce and auditory figure, and the stream of 2D figures is time-averaged to produce an auditory event – a temporally stable representation of the acoustic event in the auditory image (Patterson et al., 1992, Patterson et al., 1995). In the latest version of AIM (Patterson, van Dinther, and Irino, 2007), the time-interval dimension of the auditory image is expanded on a channel-by-channel basis to produce the sscAI.