AIM2006ModulesSAI
From CNBH Acoustic Scale Wiki
The SAI module uses the strobe points to convert the NAP into a auditory image, in which the pulse-resonance pattern of a periodic sound is stabilised. The process is referred to as Strobed Temporal Integration (STI), and it converts the time dimension of the NAP into the time-interval dimension of the stabilised auditory image (SAI). The vertical ridge in the auditory image associated with the repetition rate of the source can be used to identify the start point for any resonance in that channel, and this makes it possible to completely segregate the glottal pulse rate from the resonance structure of the vocal tract. To repeat, the abscissa of the SAI is now no longer time but time interval, and the structure in each channel is what physiologists would refer to as a post-glottal pulse, time-interval histogram.
The default SAI is ti2003. The are two SAI modules available:
- ti2003: this version adapts temporal integration to the strobe rate; that is, the higher the rate, the smaller the proportion of each NAP pulse that is added in to the auditory image.
- ti1992: Temporal integration of the NAP pulses into an array of time-interval histograms referred to as the auditory image (Patterson, 1994b)(Patterson et al., 1992).
Background
Once the strobe points have been located, the NAP can be converted into an auditory image using STI, which is a discrete form of temporal integration. Aim2006 offers two SAI options for performing the conversion: ti1992 and ti2003. Strobed temporal integration converts the time dimension of the neural activity pattern into the time-interval dimension of the stabilized auditory image (SAI) image, and it preserves the time-interval patterns of repeating sounds (Patterson et al., 1995).
The ti1992 module (Patterson, 1994b; 1992) works as follows: When a strobe is identified in a given channel, the previous 35 ms of activity is transferred to the corresponding channel of the auditory image, and added, point for point, to the contents of that channel of the image. The peak of the NAP pulse that initiates temporal integration is mapped to the 0-ms time-interval in the auditory image. Before addition, however, the NAP level is progressively attenuated by 2.5% per ms, and so there is nothing to add in beyond 40 ms. This 'NAP decay' was implemented to ensure that the main pitch ridge in the auditory image would disappear as the period approached that associated with the lower limit of musical pitch which is about 32 Hz (Pressnitzer, Patterson, & Krumbholz, 2001). The fact that ti1992 integrates information from the NAP to the auditory image in 35-ms chunks leads to short-term instability in the amplitude of the auditory image. This is not a problem when the view is from above as in the main window of the auditory image display. It is, however, a problem when the tonotopic profile of the auditory image is used as a pre-processor for automatic speech recognition because some of these devices are particularly sensitive to level variability.
The default SAI module is ti2003 which is causal and eliminates the need for the NAP buffer. It also reduces the level perturbations in the tonotopic profile, and it reduces the relative size of the higher-order time intervals as required by more recent models of pitch and timbre (Kaernbach & Demany, 1998; Krumbholz et al., 2003). It operates as follows: When a strobe occurs it initiates a temporal integration process during which NAP values are scaled and added into the corresponding channel of the SAI as they are generated; the time interval between the strobe and a given NAP value determines the position where the NAP value is entered in the SAI. In the absence of any succeeding strobes, the process continues for 35ms and then terminates. If more strobes appear within 35 ms, as they usually do in music and speech, then each strobe initiates a temporal integration process, but the weights on the processes are constantly adjusted so that the level of the auditory image is normalised to that of the NAP: Specifically, when a new strobe appears, the weights of the older processes are reduced so that older processes contribute relatively less to the SAI. The weight of process n is 1/n, where n is the number of the strobe in the set (the most recent strobe being number 1). The strobe weights are normalised so that the total of the weights is 1 at all times. This ties the overall level of SAI more closely to that of the NAP and ensures that the tonotopic profile of the SAI is like that of the NAP at all times.
Nomenclature: Normally, the sf2003 is used with ti2003, and sf1992 is used with ti1992. In such cases, for convenience, the combinations are referred to as sti03 and sti92, respectively.
Figure 10a shows the SAIs for the four example vowels using the dcgc/hl/sti03 model. Note the strong vertical ridge at the time interval associated with the pitch of the vowel. The resonance information now appears aligned on the vertical ridge. There are also secondary cycles at lower levels. The tonotopic profiles in the right-hand panels of the subfigures present a clear representation of the formant structure of each vowel, and the tonotopic profiles in the upper row of the figure for the scale value of 122 are shifted up relative to those in the lower row where the scale value is 89.
The time-interval profile in the panel below each auditory image shows the average across the channels of the auditory image. It shows that in periodic sounds, the ridges beside the vertical ridge associated with the pitch of the sound have a slant (the time interval between the vertical ridge and the slanting adjacent ridges is 1/f ms, so it degrease as f increases), and this means that the peaks in the time-interval profile associated with the adjacent ridges are small relative to that of the main vertical ridge. This feature improves pitch detection when it is based on the time-interval profile.
Figure 10b shows the SAIs for the four example vowels using the gt/hcl/sti03 model. They also exhibit strong vertical ridges at the time interval associated with the pitch of the vowel. The resonance structure is anchored to the vertical ridge but the details of the fine structure of the lower formants are less clear due to the log compression. The tonotopic profiles in the right-hand panels of the subfigures show that the lower formants are less clear but the upper formants are more clear.
Dynamic sounds
For periodic and quasi-periodic sounds, STI rapidly adapts to the period of the sound and strobes roughly once per period. In this way, it matches the temporal integration period to the period of the sound and, much like a stroboscope, it produces a static auditory image of the repeating temporal pattern in the NAP as long as the sound is stationary. If a sound changes abruptly from one form to another, the auditory image of the initial sound collapses and is replaced by the image of the new sound. In speech, however, the rate of glottal cycles is typically large relative to the rate of change in the resonance structure, even in diphthongs, so the auditory image is like a high-quality animated cartoon in which the vowel figure changes smoothly from one form to the next. The dynamics of these processes can be observed using aim2006, and the frame rate can be adjusted to suit the dynamics of the sound and the analysis. It is also possible to generate a QuickTime movie with synchronized sound for reviewing with standard media players.
Time-interval scale (linear or logarithmic): aim2006 offers the option of plotting the SAI on either a linear time-interval scale as in previous versions of AIM, or on a logarithmic time-interval scale. The latter was implemented for compatibility with the musical scale. Vowels and musical notes produce vertical structures in the SAI around their pitch period, and the peak in the time-interval profile specifies the pitch. If the SAI is plotted on a logarithmic scale then the pitch peak in the time-interval profile moves equal distances along the pitch axis for equal musical intervals. The time-intervalscale can be set to linear in the parameter file for the ti-module; change the value of the variable ti2003.
Autocorrelation: The time-interval calculations in AIM often provoke comparison with autocorrelation and the autocorrelogram (Meddis & Hewitt, 1991), and indeed, models of pitch perception based on AIM make similar predictions to those based on autocorrelation (Patterson, Yost, Handel, & Datta, 2000). It should be noted, however, that the autocorrelogram is symmetric locally about all vertical pitch ridges, and this limits its utility with regard to aspects of perception other than pitch. For example, it cannot explain the changes in perception that occur when sounds are reversed in time (Akeroyd & Patterson, 1997; Patterson & Irino, 1998) whereas the SAI can.