Effects of voicing in the recognition of concurrent syllables

From CNBH Acoustic Scale Wiki

Jump to: navigation, search
Category:Perception of Communication Sounds

The text and figures that appear on this page were subsequently published in:

Vestergaard, M.D. and Patterson, R.D. (2009). “Effects of voicing in the recognition of concurrent syllables.” J. Acoust. Soc. Am., 126, p.2860-3.

This letter reports a study designed to measure the value of voicing in the recognition of concurrent syllables. Competing syllables for both the target speaker and the distracter were either voiced or whispered in all four vocal combinations. The results show that listeners use voicing whenever it is present either to detect a target syllable or to reject a distracter. When the predictable effects of audibility were taken into account, limited evidence remained for the harmonic cancellation mechanism thought to make rejecting distracter syllables more effective than enhancing target syllables.

PACS numbers: 43.71.Bp 43.71.An, 43.66.Ba, 43.72.Qr



Voice pitch is a highly salient cue that assists listeners trying to attend selectively to one speaker in an environment with competing voices (e.g., Darwin et al., 2003; Chalikia and Bregman, 1993; Qin and Oxenham, 2005; Assmann and Summerfield, 1990; Assmann and Summerfield, 1994; Culling and Darwin, 1993; Vestergaard et al., 2009). The pitch of a voice is determined by the glottal pulse rate (GPR), which is commonly specified in terms of the fundamental frequency (F0) of the harmonic series associated with the GPR (Fant, 1970). Segregation of competing voices is also facilitated by differences in vocal tract length (VTL) (Darwin et al., 2003; Vestergaard et al., 2009). VTL determines the acoustic scale of the vocal resonances in a speech sound (Patterson et al., 2008), and this, in turn, determines what is referred to as “formant dispersion” (Fitch, 1997; Fitch, 2000). Recently, Vestergaard et al. (2009) have measured the interaction of GPR, VTL and audibility in the recognition of concurrent syllables using a paradigm that allowed them to control temporal glimpsing and the idiosyncrasies of individual speakers. They showed that at 0-dB signal-to-noise ratio (SNR) a two-semitone (ST) difference in GPR produced the same performance advantage as a 20% difference in VTL. This letter reports an extension of Vestergaard et al. (2009) designed to measure the value of voicing in the recognition of concurrent syllables. It has been proposed that the benefit of a pitch difference in the recognition of concurrent speech is that it helps the listener to reject sounds that fit the harmonic structure of the distracting voice (cancellation theory) rather than assisting the listener in selecting sounds that fit the harmonic structure of the target voice (enhancement theory). In a series of papers involving double-vowel experiments, de Cheveigné and colleagues developed a harmonic cancellation model tuned to the periodicity of the distracter (de Cheveigné et al., 1997b; de Cheveigné et al., 1997a; de Cheveigné, 1997; de Cheveigné, 1993). They showed that the advantage of a pitch difference depends primarily on the harmonicity of the distracter. In order to evaluate the feasibility of the cancellation theory for connected syllables, we employed the paradigm described above (Vestergaard et al., 2009) for voiced and whispered syllables. In whispered register, the speech signals do not provide any acoustic cues to the harmonicity that characterizes speech in voiced phonation (Abercrombie, 1967). Whispered phonation is produced by allowing turbulent air to flow through partially open glottal folds. This reduces the gradient of airflow velocities and results in a noise-like excitation of the vocal tract resonances (the formants). In naturally produced speech, whispered syllables are elongated (Schwartz, 1967) and air consumption is dramatically increased (Schwartz, 1968a). Compared to voiced speech, whispered speech is typically 15-20 dB softer (Traunmüller and Eriksson, 2000) and it has a spectral tilt of approximately 6 dB/oct. (Schwartz, 1970). Combined, these two features lead to a reduction in the perceptibility of whispered speech while it remains relatively robust. Indeed, whispered speech can convey much of the information that voiced speech can convey. Tartter (1991) found that the intelligibility of whispered vowels was 82%; only 10% lower than for voiced vowels. The recognition of whispered consonants was 64%; much above chance for the 18 vowels in their experiment (Tartter, 1989). Listeners were also able to identify speaker sex given isolated productions of whispered vowels (Schwartz and Rine, 1968), and even when presented with isolated voiceless fricatives (Schwartz, 1968b). Lass et al (1976) reported that the recognition of speaker sex dropped from 96% correct for voiced speech to 75% correct for whispered speech. Tartter and Braun (1994) have shown that listeners can accurately distinguish “frowned”, neutral and “happy” speech in both voiced and whispered register. Thus, while the purpose of whispering is to reduce audibility, whispered speech remains highly functional in other respects. When compelled to evaluate the pitch of whispered vowels [sic], listeners matched the frequency of a pure tone with the second formant of the vowels (Thomas, 1969). Thus, in the absence of temporally defined pitch, the listeners revert to spectral pitch as described by Schneider et al. (2005). We can therefore investigate the functional role of harmonicity on the segregation of concurrent syllables by removing the temporal regularity of voiced speech samples and applying a spectral lift thus simulating whispered speech. Audibility was varied by testing at different SNRs, which in turn also enabled us to evaluate the predictable effects of the spectral differences between voiced and whispered speech. This letter is about the value of voicing in the recognition of concurrent speech. We hypothesize that performance in a syllable recognition task will improve whenever the auditory system can make use of voicing, either to detect a target or reject a competing distracter. Moreover, if the mechanism of cancellation is more effective than the enhancement mechanism, listeners will be more successful at using voicing to reject a distracter than to detect a target.


The participants were required to identify syllables spoken by a target voice in the presence of a distracting voice. Performance was measured for target and distracter voices that were voiced and whispered in all combinations, for a broad range of signal-to-noise ratios (SNR).


Eight listeners (19 – 21 yrs; 3 male) participated in the study. After informed consent was obtained from the participants, an audiogram was recorded at standard audiometric frequencies to ensure that they had normal hearing. The experimental protocol was approved by the Cambridge Psychology Research Ethics Committee (CPREC).


The study consisted of two parts: (1) pre-experimental training, and (2) the main experiment on voiced and whispered concurrent syllables. The procedure was the same in both: the syllables were presented in triplets to promote perception of the stimuli as a phrase of connected speech as previously described by Vestergaard et al. (2009). The listeners responded by clicking on an orthographical representation of their answer from a response matrix on a computer screen. They were seated in front of the response screen in an IAC double-walled, sound-attenuated booth, and the stimuli were presented via AKG K240DF headphones.


The stimuli were taken from the CNBH syllable corpus previously described by Ives et al. (2005). It consists of 180 spoken syllables, divided into consonant-vowel (CV) and vowel-consonant (VC) pairs. There were 18 consonants, 6 of each of 3 categories (plosives, sonorants and fricatives), and each of the consonants was paired with one of 5 vowels spoken in both CV and VC combinations. The syllables were analyzed and re-synthesized with the STRAIGHT vocoder (Kawahara and Irino, 2004) to simulate voices with different combinations of GPR and VTL. To simulated whispered speech, the STRAIGHT spectrograms were exited with broadband noise and high-pass filtered at 6 dB/oct. This procedure removes pitch from the voiced part of the syllables and creates an effective simulation of whispered speech. The GPR and VTL values used for the target and distracter syllables are shown in Table I. The target voice sounds like a tall male speaker, while the distracting voice sounds like a female of normal height. The difference in VTL (18.2/13.9) is several JNDs for the discrimination of resonance scale for syllables (Ives et al., 2005), so even when voicing was removed from both the target and distracter, it was easy to hear the difference between the two (Vestergaard et al., 2005). In order to reduce temporal glimpsing, target and distracter syllables were paired according to their phonetic specification (Vestergaard et al., 2009). Throughout the experiment the target syllables were presented at 60 dB SPL, and the SNR varied between -15, -9, -3, 0, 3, 9 and 15 dB.

Table I. Vocal characteristics of the competing voices. A GPR value of zero indicates whispered speech.

Target voice Distracter voice
voiced whispered voiced whispered
GPR (Hz) 156.7 0 202.7 0
VTL (cm) 18.2 18.2 13.9 13.9

Pre-experimental training

The first training session comprised 15 runs with visual feedback. Each run was limited to a subset of the syllable database in order to introduce gradually the stimuli and their orthography. Then followed 380 trials without distracters in which the target syllable was in either interval 2 or interval 3 (for details on the training regime see Vestergaard et al., 2009). Then, the same procedure was repeated for whispered syllables. In another training session, the distracters were introduced with the SNR starting at 90 dB, and then gradually decreased to -15 dB over trials. In this session, the target syllable was in either interval 2 or 3, and both voiced and whispered target and distracters were used. There were 8 runs of 40 trials in this session with visual feedback. During training, performance criteria were used to ensure that the listener could perform the task for each subset of syllables before proceeding to the next stage. If a listener did not meet the criterion on a particular run, it was repeated until performance reached criterion. In total, the listeners did, at least, 1260 trials before commencing the main experiment, by which time they were very familiar with the procedure and the response interface.

Main experiment

In the main experiment, recognition performance was measured for the target voice in a 2 × 2 × 7 factorial design (i.e. 28 conditions). The target voice was either voiced or whispered, and the distracter was either voiced or whispered. Each pair of target and distracter voices was measured at 7 SNRs [-15, -9, -3, 0, 3, 9, 15 dB]. The trials were blocked in runs of 40 trials within which the voice combination and the SNR were constant. Between runs the condition was randomly chosen from the full set of the 28 conditions. Each condition was repeated three times, so in total the main experiment comprised 3360 trials of target syllables masked by distracter syllables. To increase the sensitivity of the experiment to the variation in voicing, the task was made slightly more difficult by playing the target syllable in either interval 2 or 3. A visual indicator marked the interval to which the listener should respond (for details of the rationale of this paradigm, see Vestergaard et al., 2009)


Five types of scores were considered: syllable recognition (the primary task); consonant or vowel recognition (only one has to be correct); consonant-vowel order, and manner (i.e. consonant category). The average syllable, consonant and vowel recognition scores for the four vocal conditions are shown as a function of SNR in the top panels of Figure 1. They show the expected effect of SNR and some notable effects of voicing. To control for the predictable effect of audibility caused by the difference in spectrum between voiced and whispered speech, the following analysis was run: For each trial, an audibility index (the Speech Intelligibility Index (SII), ANSI, 1997) was calculated by deriving the spectrum levels for the target and distracter syllables. The distracter’s spectrum levels were used as masker when estimating the audibility of the targets. An importance function for English nonsense syllables was used according to the ANSI standard. A transfer function (Sherbecoe and Studebaker, 1990) was then fitted to the data. This analysis allows for a data-driven audibility-controlled prediction of recognition performance. The results of this transform are shown in the bottom row of Figure 1. They show the effects of voicing once audibility has been taken into account.

Figure 1. Recognition scores as a function of 1) SNR (top panels) and 2) SII (bottom panels). The left panels (A#) show syllable recognition; the middle panels (B#) show consonant recognition, and the right panels (C#) show vowel recognition. The solid lines show performance for voiced target syllables, and the dashed lines show performance for whispered syllables. The black lines are for voiced distracters and the dark grey lines are for whispered distracters. In the B panels, the thick light grey curve shows predicted recognition according to the transformation by Sherbecoe and Studebaker, (1990). See text for details.

The scores from the experiment and the scores from the prediction described above were converted to rationalized arcsine units (RAU) (Studebaker, 1985; Thornton and Raffin, 1978). The effects of voicing on performance were analyzed by assessing the departures of the observed scores from the prediction. A 3 way repeated-measures ANOVA [2 (target) × 2 (distracter) × 7 (SNR)] was run on the prediction mismatch units (observed RAU scores – predicted RAU scores). Greehouse-Geisser correction was used to compensate for lack of sphericity. For syllable recognition (A panels in Figure 1), there were significant main effects of all three factors, target (F1,7=39.9, p=0.001, ηp2=0.84), distracter (F1,7=59.6, p<0.001, ηp2=0.89), and SNR (F6,42=8.4, p=0.002, ε=0.40, ηp2==0.55). These effects were driven by interactions between target and distracter (F1,7=11.8, p=0.011, ηp2=0.63) and between target and SNR (F6,42=3.59, p=0.018, ε=0.68, ηp2=0.33). The results of this analysis are shown in Figure 2, which also illustrates the direction of the effects.

Figure 2. The prediction mismatch for A) syllable, B) consonant and C) vowel recognition scores in rationalized arcsine units (RAU).The top panels show 1) the interactions of the vocal characteristics with SNR, and the bottom panels show 2) the interaction between target voicing and distracter voicing. The solid lines and hashing are for voiced target, and dashed is for whispered target syllables. Black lines and hashing is for voiced distracter and gray is for voiced distracter syllables.

Recognition of voiced target syllables was above the predicted value and recognition of whispered target syllables was below the predicted value. Prediction accuracy increased with increasing SNR, and this trend was more pronounced for whispered targets and less pronounced for voiced targets. Recognition performance was below the predicted value for whispered distracters, and above the predicted value for voiced distracters, and this effect was entirely driven by the whispered-target condition.


These results of the modeling can be interpreted as follows: The prediction mismatch characterizes the perceptual effects of voicing inasmuch as SII represents the amount of audibility under the given conditions. Overall, voiced syllables were better recognized than whispered syllables, and whispered distracters led to lowered recognition performance than voiced distracters. It would appear that the listeners could use voicing to reject a distracter as well as to detect a target. However, the effect for distracters was only present when the target was whispered. Similarly, recognition of voiced syllables was well predicted by the audibility model; for whispered target syllables recognition was lower than predicted and more so for low SNRs. The fact that whispered syllables were less intelligible than voiced syllables corroborates previous studies on the perception of whispered speech (Tartter, 1991; Tartter, 1989). The harmonic cancellation model (de Cheveigné et al., 1997b; de Cheveigné et al., 1997a; de Cheveigné, 1997; de Cheveigné, 1993) suggests that the value of voicing should be greater for the distracter than the target. This should lead to an asymmetry in which listeners suffer more from the removal of voicing in the distracter than in the target. The vowel recognition data in Figure 2, panel C2, support this prediction: The difference between the two black bars represents the recognition loss for the removal of voicing in the target; and the difference between the bars with dashed hashing represents the recognition loss for the removal of voicing in the distracter. Since the difference between the black bars is smaller than the difference between the hashed bars, it could be argued that the results are compatible with the cancellation model. However, it is also the case that the large effect of removing voicing in the distracter is only pronounced when the target itself is whispered. For voiced target syllables there is hardly any effect of removing voicing in the distracter, possibly because there was already a cue to the target voice provided by its resonance scale. We have previously shown that when there is a sizable difference in VTL between the competing voices, then the benefit of additional cues is diminished (Vestergaard et al., 2009). Overall, while the data do not provide strong evidence for the cancellation hypothesis, they do not rule it out.


The main conclusion of this experiment is that listeners use voicing whenever present either to detect target speech or to reject a distracting signal. Moreover, the study illustrates the importance of controlling for effects of audibility in experiments with voiced and whispered speech. A direct interpretation of the recognition scores shown in the top panels of Figure 1 would have led to an overestimation of the robustness of speech in whispered register. When the predictable effects of audibility are taken into account, the value of voicing is more correctly observed in the bottom panels of Figure 1 which show performance as a function of audibility rather than SNR. To wit, three of the four vocal conditions contained voicing in either target or distracter or both, and they show comparable results when audibility has been taken into account. By contrast, in the condition in which both target and distracter were whispered performance dropped off progressively more especially below an audibility index (AI) of 0.5. In other words, audibility predicted the identification of the target when one of the concurrent syllables was voiced but led to an overestimation of the recognition of whispered syllables masked by whispered syllables.


The research was supported by the UK Medical Research Council (G0500221, G9900369). We thank James Tanner and Sami Abu-Wardeh for help with collecting the data, and Nick Fyson for assistance in producing the programs that ran the experiments.


Personal tools