Obligatory streaming based on acoustic scale difference

From CNBH Acoustic Scale Wiki

Jump to: navigation, search
Category:Perception of Communication Sounds

Etienne Gaudrain , Alessandro Binetti, Roy Patterson

This document has been presented at the BSA Short Paper Meeting, 18-19 September 2008, York.



The voice of a speaker contains information that helps to identify the speaker and to segregate their voice from those of others in a multi-speaker environment. Specifically, there is information about the size of the speakers vocal folds in their mean glottal pulse rate (GPR) and information about their vocal-tract length (VTL) in the formant frequencies of their vowels. It seems likely that this size information is used to identify and track a target individual in a multi-speaker environment.

Darwin et al. (2003) have reported that GPR and VTL reinforced each other in concurrent sentence reception and raise performance above where it might be expected to be on the basis of either of these components on its own. This observation suggests that for normal combinations of GPR and VTL values, these two factors interact to form a reliable speaker-size estimate that is used to segregate concurrent speakers. The question raised in the current study is whether the mechanism is primitive and automatic, or whether it involves higher-level voluntary cognitive processing.

The technique that Darwin et al. used does not enable us to answer this question. There is, however, an alternative technique involving obligatory streaming, i.e. streaming that cannot be suppressed. Obligatory streaming depicts purely automatic processes that seem to take place in the auditory periphery (Pressnitzer et al., 2008). Obligatory streaming has been studied for a difference in GPR only (Gaudrain et al., 2007), and a difference in VTL only (Tsuzaki et al., 2007), but never for both dimensions together. The aim of the present study is to evaluate whether GPR and VTL perceptual analogs interact in the automatic analysis of auditory scenes.

This interaction would have many implications:

  1. The GPR is coded in a temporal profile that comes from the temporal analysis of the signal, whereas the VTL is coded in a spectral profile that comes from the tonotopic profile, often called excitation pattern. The interaction of GPR and VTL at a primitive level would mean the interaction of the result of a temporal analysis with a spectral analysis. The dual profile of the stimuli will be used to represent these two profiles in a single graph.
  2. The GPR and VTL are used to built an estimate of the size of the speaker. If streaming can be proved to be based on speaker-size judgement, then it would mean that size estimation is fully automatic.
  3. Alternatively, since the physical dimension underlying speaker-size is the acoustic scale (of the source), it is possible that at a primitive level, acoustic scale would be more important than speaker-size.


Acoustic scale and size judgement

For the purpose of the experiment, 4 speakers have to be defined in each condition. Two rules to define these speaker have been used: one based on the acoustic scale, and the other based on the subjective size judgement. The 4 speakers define a rectangle in the GPR-VTL plane. In the current experiments, pairs of speakers were chosen only from the end of the diagonals.

Acoustic Scale condition

Figure 1. Acoustic Scale condition. Left panel: definition of the Matched and Unmatched conditions in the GPR-VTL plane. The blue numbers represent the size judgement obtained by Smith and Patterson (2005). Right panel: dual profiles for a /la/ in the Matched (top) and Unmatched (bottom) conditions.

The difference in GPR and VTL can the same when converted in semitones. This results in a consistent change in Acoustic Scale. This condition is described in Figure 1. The VTL difference can be converted to semitones using the following formula 1.

\mathrm{VTL\ in\ semitones} = 12\times\log_2\left(\frac{\mathrm{VTL}_0}{\mathrm{VTL}}\right)(1)

where VTL0 is the VTL of the original voice (15.54 cm). The GPR in semitones, is calculated re 120 Hz, the average GPR of the original voice. The original voice is the voice of the speaker who recorded the syllables.

When both GPR and VTL are changed by the same amount in the same direction, then the scale of the spectral fine structure (GPR) and the scale of the spectral envelope (VTL) move in the same direction. This condition is called Matched. It simulates a consistent change of all the acoustical dimensions of the source. When the GPR and VTL are changed by the same amount, but in opposite directions, then the displacements of the spectral fine structure and of the spectral envelope do not add to form a consistent overall change in acoustic scale. This condition is called Unmatched.

Size Judgement condition

Figure 2. Size Judgement condition. Again the blue numbers are the size judgement derived from the data of Smith and Patterson (2005). The solid black line represents the combination of GPR and VTL that yield the same size judgement of 4.80. The purple and orange diagonals represent the Unmatched and Matched conditions respectively. Both axes, GPR and VTL, are represented in semitones re the original speaker (GPR=120 Hz, VTL=15.54 cm).

Smith and Patterson (2005) have evaluated the perceived size of vowels for a wide range of GPR-VTL combination. Subjects were asked to judge the size of the speaker on a seven-steps scale ranging from very short to very tall. A two dimensionnal polynomial fitting was used to derive the size judgement data for any combination of GPR and VTL in the tested range. The Size Judgement condition is based on iso-size-judgement contours, i.e. couples of GPR and VTL that yield the same size judgement as illustrated on Figure 2. When two voices are choosen on the same iso-contour, they do not differ in perceived size, and the change in VTL is constrained by the change in GPR. In that case, the change in VTL is compensated by the change in GPR that goes in the other direction, and this is then an Unmatched condition. If the same differences in VTL and GPR are used but in the same direction to produce a Matched condition, then the two speakers differ in size. Two conditions, A and B have been tested. The condition A take benefit from the curvature of the 4.80 iso-contour to have a larger VTL difference. The B condition aims to be centered around the average speaker curve. The two conditions are described in Figure 3.

Table 1 gathers all the VTL and GPR values, perceived size values, and differences in VTL, GPR and perceived size for all the conditions described above.

Table 1. Values of the various parameters of the speakers involved in each condition.
Sz stands for Size-judgement, M for Matched and U for Unmatched.
Condition GPR1 GPR2 ΔGPR VTL1 VTL2 ΔVTL Sz1 M Sz2 M ΔSz M Sz1 U Sz2 U ΔSz U
Acoustic Scale, 3 semitones 120 Hz 142 Hz 3 sem. 15.5 cm 13.1 cm 3 sem. 4.81 4.14 0.67 4.15 4.76 0.61
Acoustic Scale, 6 semitones 120 Hz 169 Hz 6 sem. 15.5 cm 11.0 cm 6 sem. 4.81 3.39 1.42 3.34 4.73 1.39
Size Judgement, A 180 Hz 320 Hz 10 sem. 20.4 cm 16.0 cm 4.2 sem. 5.20 4.58 0.62 4.80 4.80 0.00
Size Judgement, B 67 Hz 101 Hz 7.1 sem. 17.56 cm 15.85 cm 1.9 sem. 5.62 4.94 0.68 5.25 5.25 0.00
Figure 3. Size judgement iso-contours, positions of the speakers of the Size Judgement conditions (A and B) and average speaker curve from Peterson and Barney (1952). The green dot represents the original speaker.

Streaming paradigms

Delay detection paradigm

Figure 4. Delay detection paradigm. Upper panel: Integrated percept. Lower panel: Segregated percept. See text for details. Adapted from Roberts et al. (2002).

The delay detection paradigm has been introduced by Roberts et al. (2002). The stimulus is a sequence of alternating tones. The beginning of each sequence is isochronous, i.e. the tones emitted by S2 (the upper stream in Figure 4) are centred between the tones emitted by S1. In the present case, tones are syllables, and each syllable is preceded and followed by 100 ms in the isochronous portion. Then comes a transition portion where the tones coming from S2 are progressively delayed up to the third portion where S2 tones are constantly delayed by dT. When the percept is integrated, i.e. the tones coming from S1 and S2 are fused in a single auditory stream, the delay of the S2 tone is relatively easy to detect since it is compared to the previous S1 tone. But when the percept is segregated, then the detection of the delay becomes harder because the time reference is the previous S2 tone. Measuring the detection threshold for this delay provides an objective evaluation of the state of segregation: larger thresholds mean stronger streaming.

The 24 syllables for each sequence used are randomly chosen from a set of 50 (5 vowels: /a, e, i, o, u/ and 10 consonants: /b, d, f, g, h, k, l, m, n, p/). The stimuli are presented in a 3 down-1 up, 2I2AFC procedure to determine the dT which yields 79%-correct on the psychometric function, defined as the detection threshold. The procedure of Roberts et al. (2002) has been adapted to the syllables. Notably, the duration of the items have been substantially increased (180 ms), and so has been the duration of the silence between each item (100 ms). This paradigm has been use for the Acoustic Scale condition only.

Possible issue to the adaptation to speech: the onset of the syllable varies with the consonant and generally less clearly defined than for a pure or complex tone. That's why the silence has been increased between the items. This is somewhat attenuated by the fact that the judgement occurs on a number of pairs that involve different consonants. However, extra-care might be taken by building the sequences considering the importance of the last consonants.

Repeated syllable paradigm

Figure 5. ‎Repeated syllable paradigm. The sequence is composed of random syllables (alternating speakers) with no repetition except once in the Repetition portion. This repetition is perceived only if the percept is integrated.

This paradigm, presented in Figure 5, has been inspired by Christophe Micheyl. A sequence of syllables, alternatively pronounced by two speakers S1 and S2, presents one, and only one, repetition. This repetition is across the speakers, i.e. the repeated syllable is pronounced by one speaker, and just after by the other speaker. If the percept is integrated, i.e. no streaming, the repetition is obvious. On the contrary if the percept is segregated, then the repetition is fairly hard — or even impossible — to detect. The task is then simply to report the repeated syllable. The parameters used for this specific experiment are reported in Figure 5.


Delay detection and Acoustic Scale

Figure 6. Results from the Acoustic Scale condition in the Delay Detection Paradigm (average for 2 subjects). The orange bars represents the Matched condition, while the purple bar represent the Unmatched condition. The orange and purple stripes are where the Matched and Unmatched conditions cannot be distinguished. The abcissa displays the difference in GPR and VTL, in semitones. The ordinate displays the delay detection threshold in milliseconds. Greater values come with more streaming.

The results are presented in Figure 6. The paradigm effectively produced a difference between the control condition (no difference in GPR nor in VTL), and the other conditions (3 and 6 semitones). This difference is due to the fact that segregation occurs in the latter conditions. Roberts et al. (2002) found thresholds about 2.5 times bigger than the in the control condition when streaming occured, and we find here thresholds about 2 times bigger than in the control condition. We have more build-up time, but the tempo is a lot slower, and time judgement may be hindered by smooth syllable onsets.

Given the current number of subjects and repetitions, no statistical analysis can be performed. These preliminary results seem to indicate that there is no difference between the Matched and Unmatched conditions. This could be due to some saturation effect as suggested by the fact that there is roughly no difference between the 3 semitones and 6 semitones conditions. However, the upper limit for the threshold is 100 ms and the longuest threshold measured here is smaller than 40 ms. A saturation effect can then very probably ruled out (except if there is a second order effect due to the fact that the tempo is changed locally when a delay is introduced, and is local change could affect streaming). This would suggest that the acoustic scale of the whole source is not used in streaming, and that the temporal profile and the spectral profile contributions to streaming remain independant.

However, it is not clear what the acoustic scale of the source become in the Unmatched condition. Also, it is possible that the auditory system is more used to analyse human voice and therefore certain combination of GPR and VTL that would be relevant in such sources. As displayed in Table 1, the perceived size difference in the Matched (0.67 or 1.42) and Unmatched (0.61 or 1.39) conditions are practically the same. This means that if streaming can be based on a perceived speaker-size difference, the amount of streaming induced here would be the same in the Matched and Unmatched conditions.

Repeated syllable and Size Judgement

As described before, two conditions were used here. The results for the two conditions are plotted in Figure 7. The first bar of the left panel of Figure 7 shows the performance of identification of the repeated syllable when the two speakers are identical to the original speaker: 100%-correct. The results in the condition A are much lower (around 25%) showing that it becomes very difficult to perceive the repeated syllable when the speakers are very different. However these performances are too close to the chance level. Any result in this region could be due to a flooring effect preventing any difference between the Matched and Unmatched conditions to be perceived. This is the reason why another condition was chosen. The condition B was chosen in order to reduce the effect of streaming to obtain results in the mid range where sensitivity should be maximal. Also the voices used in the condition A were all odd voices, as denoted by there distance to the "normal speaker" curve in the Figure 3.

Figure 7. Results from the Size Judgement condition in the Repeated syllable paradigm. As before, the orange bars are for the Matched condition, which therefore has a perceived-size difference. The purple bars are for the Unmatched condition, where the speakers where picked out from an iso-contour of the Size Judgement surface. The dashed line in the two panels signales the chance level. Left panel: Results in the condition A (see Figure 3). The first bar, labelled "Same", is the result when the two speakers are the same and are the original voice. This panel reports the average of 50 presentations in each conditions. Right panel: Results in the condition B. This panel reports the average of 100 presentations in each condition.

In the condition B, the results are between 60% and 80%, which might still be a bit to high. However, the results now show a difference between the Matched and Unmatched condition: there seems to be more streaming in the Unmatched condition than in the Matched condition. This somewhat unexpected result can find at least to explanations:

  1. The streaming could be based preferentially on "voice weirdness" instead of size. The two voices with same size are further apart from the "normal-speaker" curve than the speakers of the Matched condition.
  2. The identification perfomances are different in the two conditions. Smith et al. (2005) observed recognition performance for long (>500ms) isolated vowels. The interpolated recognition scores for each of the 4 speakers used in the condition B (see Figure 3) are: U1 98.5%, U2 95.5%, M1 97.5%, M2 97.0%. So all these results are very close to 100% and, in average, there is not a much a difference between the Matched (97.3%) and Unmatched (97.0%) conditions. So one might not expect any influence of this one our results. However, this recognition performance was measured for isolated vowel, i.e. without distraction, and with long vowels. The outcome could be different for short syllables in a sequence and the very slight difference observed here could become significant.

To test for this latter hypothesis, the same paradigm of Repeated syllable detection was used in four conditions. In each condition, the two speakers were the same so that no streaming was induced by a speaker difference. The 4 speakers described in Figure 3 for the condition B were used: U1, U2, M1, M2. The results are displayed in the left panel of Figure 8.

The average score is 100% for only speaker: M2. The other speakers yield score around 80%-correct, significantly below the scores found by Smith et al. (2005). The right panel of Figure 8 shows the scores in the streaming test normalised with these identification results. It seems that no difference between the Matched and Unmatched condition remains after normalisation. The identification seems then to account for most of the difference shown in Figure 7.

Figure 8. Left panel: Identification score for the 4 speakers U1, U2, M1 and M2. Right panel: normalised streaming score. Orange bars are for the Matched condition, and purple bars are for the Unmatched condition.


This section describes the pitfalls of the current pre-experiment and propositions to avoid them.


We want to thank David R. Smith for kindly providing the data presented in Smith and Patterson (2005) and Smith et al. (2005).

Other versions of this article


Personal tools