# The interaction of glottal-pulse rate and vocal-tract length in judgements of speaker size, sex and age

 The text and figures that appear on this page were subsequently published in: Smith, D.R.R. and Patterson, R.D. (2005). “The interaction of glottal-pulse rate and vocal-tract length in judgements of speaker size, sex, and age.” J. Acoust. Soc. Am., 118, p.3177-3186.

Glottal-pulse rate (GPR) and vocal-tract length (VTL) are related to the size, sex and age of the speaker but it is not clear how the two factors combine to influence our perception of speaker size, sex and age. This paper describes experiments designed to measure the effect of the interaction of GPR and VTL upon judgements of speaker size, sex and age. Vowels were scaled to represent people with a wide range of GPRs and VTLs, including many well beyond the normal range of the population, and listeners were asked to judge the size and sex/age of the speaker. The judgements of speaker size show that VTL has a strong influence upon perceived speaker size. The results for the sex and age categorization (man, woman, boy, or girl) show that, for vowels with GPR and VTL values in the normal range, judgements of speaker sex and age are influenced about equally by GPR and VTL. For vowels with abnormal combinations of low GPRs and short VTLs, the VTL information appears to decide the sex/age judgement.

David Smith , Roy Patterson

## Introduction

When the radio or the telephone presents us with a previously unknown speaker, we rapidly develop a distinct impression of whether the speaker is an adult or a child, and if an adult, whether it is a man or a woman. This paper is concerned with the acoustic cues that people use to make these judgements. One highly-salient cue is voice pitch; adult men have low pitches, young children have high pitches, and adult women lie in the middle. Pitch is determined by the rate of opening and closing of the vocal folds (glottal-pulse rate, GPR). Another potent cue is vocal-tract length (VTL); large adult men have the longest VTLs, children have the shortest VTLs, and women have intermediate VTLs (Fitch and Giedd, 1999). Differences in VTL lead to shifts in the frequency of the prominent spectral peaks (formants) of speech (Fant, 1970). We have shown that changes in simulated VTL of as little as 7% can be reliably discriminated (Smith, Patterson, Turner, Kawahara and Irino, 2005). It is unclear how the different effects of GPR and VTL are combined to influence the perception of speaker size, sex and age. The purpose of this paper was to measure the interaction of GPR and VTL in judgements of speaker size, and to the categorization of speakers according to sex and age (man, woman, boy or girl). Recently, we have shown that when listeners are given two sequences of four vowels, and the VTL for one sequence is longer than for the other, listeners are capable of discriminating VTL differences of 6-10%, over a wide range of GPR and VTL values (Smith et al., 2005). The experiments used a 2AFC discrimination task which only requires the listener to make a relative size judgement. A second motivation for the present paper was to determine the extent to which listeners can make consistent judgements about speaker size, and consistent judgements about the sex and age of the speaker (man, woman, boy, or girl).

### The interaction of GPR and VTL in judgements of speaker size, sex and age

We wished to determine how GPR and VTL interact in the perception of speaker size. Given the strong correlation of VTL with speaker size, we would expect that VTL has a substantial affect on the perception of speaker size. There is also a correlation between GPR and size, although it is not as strong, and pitch is a highly salient property of a person’s voice. With regard to the perception of speaker sex and age, we wished to determine the combinations of GPR and VTL that are associated with the categories used naturally by people, that is, man, woman, boy and girl. Specifically, we wished to demonstrate that listeners would reliably assign combinations of GPR and VTL found in the normal population to the expected category, and we wished to investigate how they would extend the use of the categories to combinations of GPR and VTL well beyond the range normally encountered. Finally, we wanted to compare the listener’ speaker-size judgements with their use of the categories, man, woman, boy, girl, particularly in the extended region of GPR and VTL values.

## Method

Listeners were presented isolated vowels scaled over a large range of GPR and VTL values, and requested to make two judgements about each vowel: the height of the speaker (seven point descriptive rating) and their natural category (man, woman, boy, or girl).

### Stimuli

Figure 1.The open circles show the GPR and VTL combinations of the stimuli used in the speaker size and sex/age categorization experiments. The circles in the top panel show the “narrower” range of (7x7) sample points(GPRs of 80, 105, 137, 179, 234, 306, and 400 Hz; VTLs of 7.8, 9.3, 11.0, 13.2, 15.7, 18.7, and 22.2 cm). The bottom panel shows the “wider” range (GPRs of 61, 87, 125, 179, 256, 366, and 523 Hz; VTLs of 6.5, 8.2, 10.4, 13.2, 16.7, 21.3, and 26.8 cm). The four ellipses show the normal range of GPR and VTL values in speech for men (M), women (W), boys (B), and girls (G), derived from the data of Peterson and Barney (1952). Each ellipse contains 99% of the individuals from the respective category.

The five English vowels (/a/, /e/, /i/, /o/, /u/) of an adult male (author, RP) were recorded in natural /hVd/ sequences (i.e., haad, hayed, heed, hoed, who’d), using a high-quality microphone (Shure SM58-LCE) and a 44.1-kHz sampling rate. The vowels were sustained (e.g., haaaad) to allow isolation of a stationary vowel component of relatively long duration, which was free of co-articulation with the preceding /h/ and the following /d/. The speaker’s vocal-tract shape determines the vowel type. The speaker’s VTL determines the scale of the resonance, and thus the position of the vowel pattern along the frequency dimension. The scaling of the vowels was performed by STRAIGHT (Kawahara, Masuda-Kasuse and de Cheveigne, 1999; Kawahara and Irino, 2005). This sophisticated speech processing software uses the classical source-filter theory of speech (Dudley, 1939) to segregate GPR information from the spectral-envelope information associated with the shape and length of the vocal tract. Liu and Kewley-Port (2004) have reviewed STRAIGHT and commented favourably on its ability to manipulate formant-related information. STRAIGHT produces a GPR-independent spectral envelope that accurately tracks the motion of the vocal tract throughout the utterance. Once STRAIGHT has segregated a vowel into a GPR contour and a sequence of spectral-envelope frames, the vowel can be resynthesized with the spectral-envelope dimension (frequency) expanded or contracted, and the GPR dimension (time) expanded or contracted. Moreover, the operations are largely independent. Utterances recorded from a man can be transformed to sound like a women or a child; examples are provided on our web page2. The resynthesized utterances are of high quality even when the speech is resynthesized with GPR and VTL values well beyond the normal range of human speech (provided the GPR is not much greater than the frequency of the first formant, cf. Smith et al., 2005). STRAIGHT is reviewed in Kawahara and Irino (2005). The scaling of GPR consists of expanding or contracting the time axis of the sequence of glottal events. The scaling of VTL is accomplished by compressing or expanding the spectral envelope of the speech linearly along a linear frequency axis. On a logarithmic frequency axis, the spectral envelope shifts along the axis as a unit. The change in VTL is described by the spectral envelope ratio (SER), that is, the ratio of the unit on the new frequency axis to that of the axis associated with the original recording. Values of SER less than unity indicate lengthening of the vocal tract to simulate larger men, and SERs greater than unity indicate shortening of the vocal tract to simulate smaller men, women and children. The SER values of STRAIGHT can be converted to VTL values by noting that, a) the speaker of our original vowels was of normal height, b) that the VTL of the average-sized adult male is 15.5 cm (cf. Fitch and Giedd, 1999), and c) assuming that formant frequencies scale linearly with VTL (Fant, 1970). The data in this study are reported in GPR and VTL units. Following the scaling of GPR and VTL by STRAIGHT, a cosine-squared gating function (10-ms onset, 30-ms offset, 465-ms plateau) was used to select a stationary part of the vowel. The RMS level was set to 0.025 (relative to maximum ±1). The stimuli were played by a 24-bit sound card (Audigy 2, Sound Blaster), through a TDT anti-aliasing filter with a sharp cutoff at 10 kHz and a final attenuator, and presented binaurally to the listener over AKG K240DF headphones. Listeners were seated in a double-walled, IAC, sound-attenuating booth. The sound level of the vowels was 66 dB SPL.

### Procedures

The experiments were performed using a single-interval, two-response paradigm. The listener heard a scaled version of one of five stationary English vowels (/a/, /e/, /i/, /o/, /u/), and had to make one judgement about the size of the speaker (very short, short, quite short, average, quite tall, tall, very tall)3 and a second judgement about the sex/age of the speaker (man, woman, boy, girl). The order in which the judgements were made was left to the listener. Size and sex/age judgements were made by selecting the appropriate button on a response box displayed on a monitor in the booth. The level of the vowel was roved in intensity over a 10 dB range. Since the judgements are subjective there was no feedback. The experiment was performed for two ranges of GPR and VTL values as shown in Fig. 1. The narrower range (Fig. 1a) was chosen to encompass the range of GPR and VTL values encountered in the normal population; GPR varied from 80 to 400 Hz in six logarithmic steps (7 sample points), and VTL ranged from 22.2 cm to 7.8 cm in six logarithmic steps (7 sample points). The four ellipses show estimates of the normal range of GPR and VTL values in speech for men, women, boys and girls, derived from the Peterson and Barney (1952) vowel database. In each case, the ellipse encompasses 99% of the individuals in the Peterson and Barney data for that category of speaker4. The wider range (Fig. 1b) was chosen to extend the judgements well beyond the values encountered in everyday speech; GPR varied from 61 to 523 Hz in six logarithmic steps, and VTL ranged from 26.8 cm to 6.5 cm in six logarithmic steps. These VTLs simulate speakers ranging from a small child 0.6-m high (VTL=6.5 cm) to a giant 3.7-m high (VTL=26.8 cm)5. A run of judgements consisted of one presentation of each GPR-VTL combination for all five vowels, presented in a pseudo-random order (a total of 7 GPRs x 7 VTLs x 5 vowels, or 245 trials). Each run took approximately 30 minutes to complete. Each listener contributed a block of five runs to the database for the narrower range of judgements about speaker size and sex/age, and a block of five runs to the database for the wider range of judgements about speaker size and sex/age. The starting range (cf. Fig. 1a or Fig. 1b) was counterbalanced across listeners. The overlap in GPR and VTL values in the two ranges allows an across-condition test of the consistency of size and sex/age judgements. This helps us to see how different ranges of input sounds are stretched to the available 7 point response, and how that mapping is influenced by the frames of reference provided by the two different ranges of GPR and VTL of the vowel sounds. Eight listeners participated in the experiments, three male and five female. They ranged in age from 21 to 39 years. All had normal absolute thresholds at 0.5, 1, 2, 4 and 8 kHz.

## Results

Figure 2.Speaker size judgements collapsed across VTL (upper panel) and GPR (lower panel), separately for the narrower and the wider ranges (cf. Fig. 1). The arrows on the inset show the dimension over which the data were collapsed. The open circles show the data from the narrower range and the solid circles from the wider range. The dotted line is the best fitting line for the wider range; the dashed line is the best fitting line for the narrower range, and the solid thick line is the best fitting line for the combined data. The error bars are ± one standard error of the mean (calculated from the average of the eight listeners, where each listener’s average is based on the seven values per point over which the data were collapsed). Each datum point is based on 1400 trials.

Broadly speaking, the results show that judgements of speaker size and sex/age are affected both by GPR and VTL (Figs 3-4 respectively). Listeners reliably reported that vowels spoken with a very low GPR and a very long VTL came from a very tall person; increasing the GPR or shortening the VTL reliably reduced the reported size of the speaker. The influence of VTL upon these size judgements was very strong, as shown by the marked fall-off in reported speaker size as VTL shortened. Examination of the speaker size judgements over the course of the experiment showed little evidence of learning; listeners can do the task at near asymptotic levels almost straightaway. In the perception of sex and age (man, woman, boy, or girl), GPR and VTL had about the same influence in the narrower range about the normal ellipses, but in the wider range, for the more unusual combinations of GPR and VTL, it is VTL information which appears to decide the sex/age judgement.

### The effect of stimulus range on speaker size judgements

We will begin by comparing the size judgements obtained from the two ranges of GPR and VTL values (cf. Fig. 1a and 1b) because the results show that they are essentially sampling the same size surface, and so the data from the two ranges can be combined for subsequent analyzes. Figure 2 shows the column and row averages for both the narrower and the wider ranges; specifically, the upper panel shows the data for the two ranges collapsed across VTL (column averages), and the lower panel shows the data collapsed across GPR (row averages), as indicated by the insert schematic. In both panels, the data from the two ranges are seen to fall along similar lines (dashed and dotted for the narrower and wider ranges, respectively). For the GPR column averages in the upper panel, the slope of the line fitted to the data from the narrower range is slightly shallower than the slope of the line fitted to the data from the wider range. For the VTL row averages in the lower panel, the reverse is true; the slope for the wider range is slightly shallower than that for the narrower range. In both cases, when a single line (solid) was fitted to the combined data from the two ranges, it was found to provide an excellent fit to the full data set. Accordingly, the data from the two ranges were combined for subsequent analyzes.

### The interaction of GPR and VTL in judgements of speaker size

Figure 3.Perceived size (in color) as a function of GPR and VTL on logarithmic axes. The size scale from “very short” to “very tall,” is represented by the spectrum of colors from dark-blue (1) to brown-red (7). The points where speaker size ratings were measured are shown by the open circles; between the data points, the surface was derived by interpolation. The data were averaged across all five vowels and eight listeners, so each point is based on 200 trials. The four ellipses show the range of GPR and VTL in speech for men (M), women (W), boys (B), and girls (G), as derived from the data set of Peterson and Barney (1952).
Figure 5.Perceived speaker size as a function of GPR, for VTLs of 6.5, 13.2,and 26.8 cm. The open and solid circles show data from the narrower and wider stimulus ranges, respectively. The solid lines show the best-fitting regression lines for perceived speaker size rating as a function of the natural logarithm of GPR. The error bars are ± one standard error of the mean (calculated from the average of the eight listeners). Each datum point is based on 200 trials.
Figure 6.Perceived speaker size as a function of VTL, for GPRs of 61, 179, and 523 Hz. The solid lines show the best-fitting regression lines for speaker size rating as a function of the natural logarithm of VTL. For all other details see Fig. 5.
Figure 4. Sex and age categorizations. The data are presented as 2D surface plots with color showing probability of assigning a given GPR-VTL combination to one of four categories (man,woman, boy, or girl). The points where sex/age judgements were collected are shown by the open circles; between the data points the surface was derived by interpolation. At each GPR-VTL point, the probabilities from the four panels sum to 1 (imagine the four separate 2D maps stacked vertically and aligned over each other). The data is averaged across all five vowels and eight listeners (each sample point probability based on 200 trials). The dotted black contour line marks the classification threshold, that is, a probability ${\geqslant}$ 0.50 of consistently choosing one category out of the four available. The region of GPR-VTL values enclosed by this line defines a region categorized as one particular sex or age. The four ellipses show the range of GPR and VTL in speech for men (M), women (W), boys (B), and girls (G), as derived from the data set of Peterson and Barney (1952).

The size judgements for both the wider and narrower ranges are presented in Fig. 3 as a 2D surface plot, averaged over the five vowels and eight listeners. The abscissa is GPR and the ordinate is VTL, both on logarithmic axes; color shows perceived speaker size. The GPR-VTL points where speaker size ratings were measured are shown by the open circles; between the data points, the surface was derived by interpolation6. The consistency of the size ratings across the two ranges (cf. Fig. 1) is shown by the similarity of the ratings for adjacent stimuli from the two data sets. The seven categories of the size rating scale, from “very short” to “very tall”, were assigned ordinal values from 1 to 7, and they are represented by the spectrum of colors from dark-blue (1) to brown-red (7). The surface shows, as expected, that the combination of a long vocal tract with a low pitch is consistently heard as a large or very large person, and the combination of a short vocal tract with a high pitch is consistently heard as a small or very small person. The four ellipses show the normal range of GPR and VTL in speech for men, women, boys, and girls (Peterson and Barney, 1952). In each case, the ellipse encompasses 99% of the individuals in the Peterson and Barney data for that category of speaker (man, woman, boy or girl). The figure shows that, although the perception of speaker size is affected both by VTL and GPR, the effect of VTL is stronger than that of GPR, at least in this coordinate system. For instance, for a constant GPR of 61 Hz, as we move vertically from a long VTL of 26.8 cm to a short VTL of 6.5 cm, the size rating goes from 6.2 (“tall”) to 1.7 (“short”). The greatest change in perceived size as a function of change in GPR is for a VTL of 26.8 cm, where the size rating goes from 6.2 (“tall”) at 61 Hz to 4.0 (“average”) at 523 Hz. The change in the perception of speaker size as a function of GPR and VTL was quantified in terms of the slopes of lines across the size surface in Fig. 3 parallel to the GPR and VTL axes. Perceived speaker size is shown as a function of GPR for three values of VTL in Fig. 5, namely, the two extreme VTLs (6.5 cm and 26.8 cm) associated with very short and very tall people, and a central value (13.2 cm) associated with an average-sized woman. Regression lines were fitted to the speaker size ratings as a function of the natural logarithm of GPR (solid lines in Fig. 5). They show that changes in GPR have the most effect when VTL is at its longest (26.8 cm; slope of -1.04). As VTL decreases to 13.2 cm, the slope decreases by about 60 % (slope of -0.40), and as it decreases further to 6.5 cm, the slope becomes flat (0.01), indicating no change in speaker size whatsoever. The negative correlation between GPR and perceived speaker size is highly significant at the longer VTLs of 13.2 cm and 26.8 cm ( <0.001 and <<0.001 respectively, based on a one-tailed Spearman’s rank order correlation test for non-parametric variables); the correlation is obviously not significant when VTL is 6.5 cm. Similarly, perceived speaker size is shown as a function of VTL, for three GPR values in Fig. 6; namely 61, 179 and 523 Hz. Again, they are the extreme values from the wider range, (61 and 523 Hz), and the central value associated with an average-sized woman (179 Hz). Regression lines were fitted to the speaker size ratings as a function of the natural logarithm of VTL (solid lines in Fig. 6). The slopes of these VTL lines are all steeper than those of the GPR lines in Fig. 5. The slopes of these VTL lines become steeper as GPR decreases; the gradient is 1.50, 3.06 and 3.57 for GPRs of 523, 179 and 61 Hz, respectively. The correlation between VTL and perceived speaker size is highly significant for all three lines ( <<0.001 based on a one-tailed Spearman’s rank order correlation test for non-parametric variables). Figures 5 and 6 show an interaction between GPR and VTL in the perception of speaker size, especially at extreme GPR or VTL values. Simulated speakers that would only stand two feet tall, with very short VTLs (Fig. 5, VTL=6.5 cm), are always judged as short regardless of their GPR. Simulated giants of 12 feet (Fig. 5, VTL=26.8 cm) are always heard as above average height, but their estimated height declines as GPR increases. Figure 5 shows that the perception of speaker size is strongly affected by VTL, but that the effect weakens as GPR increases (cf. the decrease in slope for GPRs of 61, 179 and 523 Hz).

## Discussion

The size rating experiments show that listeners make consistent judgements about speaker size given a sequence of vowel sounds (Fig. 3). Both GPR and VTL affect judgements of speaker size (Figs 5-6), and the effect of VTL is strong enough to change speaker size estimates from tall to short. The sex and age judgements are also affected both by the GPR and the VTL of the vowels (Fig. 4). The data show that sex and age are not dictated solely by GPR or VTL; rather, there is an interaction between these variables that means that specific combinations of GPR and VTL act as robust indicators of sex and age.

### Speaker sex and age: the interaction of GPR and VTL

Previous research attempting to identify those acoustic properties of male and female voices responsible for our perception of sex type, have used either statistical clustering methods (e.g. Childers and Wu, 1991; Wu and Childers, 1991; Bachorowski and Owren, 1999) or perceptual categorization experiments (e.g. Schwartz, 1968; Schwartz and Rine, 1968; Ingemann, 1968; Lass et al., 1976). The statistical clustering studies have consistently highlighted GPR and vocal tract related variables as explaining most of the variance between the speech sounds of adult males and females (Childers and Wu, 1991; Bachorowski and Owren, 1999). Some studies have shown that vocal tract information alone can be used to identify speaker sex (Schwartz, 1968; Ingemann, 1968; Schwartz and Rine, 1968). Other studies have reported that GPR is a much stronger cue to speaker sex than VTL (Lass et al., 1976). Statistical clustering studies suggest that GPR and VTL are highly correlated (Childers and Wu, 1991; Wu and Childers, 1991). Other studies suggest that formant information can be important in discriminating speaker sex (Coleman, 1976; Whiteside, 1998) but generally pitch is dominant (Whiteside, 1998). Recently, Bachorowski and Owren (1999) have shown that sex classification accuracy is excellent using only GPR or only VTL, but best using both. Our reasons for wishing to measure the interaction of GPR and VTL in sex/age judgements were based on two main factors. First, we believe that the auditory system employs a scale invariant neural transform to normalize natural sounds for size prior to more central processes like speaker identification (e.g. Irino and Patterson, 2002; Turner, Al-Hames, Smith, Kawahara, Irino and Patterson, 2005). We have recently reported evidence that human listeners are able to discriminate and use size information in speech sounds (vowels), suggesting that size information is actively used in auditory perception (Smith, Patterson and Jefferis, 2003; Smith and Patterson, 2004a; Smith et al., 2005). We were thus interested in how speaker size information, as mediated by VTL and GPR cues, influenced decisions in natural sex/age categorization (man, woman, boy, or girl). Second, both statistical and perceptual classification studies are limited to databases of sounds that are from normal groups, i.e. recorded from largely homogeneous (usually adult) males and females. Thus the range over which the independent variables could be manipulated was necessarily limited. The vocoder STRAIGHT (Kawahara et al., 1999; Kawahara and Irino, 2005) enabled us to manipulate the GPR and VTL of vowels independently of each other over a huge range. These speech sounds are of high quality even when pushed well beyond the normal range of speech. This allows unprecedented control over our main experimental variables, across a much wider range of GPRs and VTLs than has been used previously. We found that both GPR and VTL contribute to listeners’ perception of the sex and age of a speaker (Fig. 4). If GPR was the sole perceptual determinant of the sex and age of the speaker (man, woman, boy or girl), then listeners would only be able to reliably classify most men (GPR < ~155 Hz) and the higher-pitched girls (GPR > ~330 Hz). If VTL was the only perceptual marker to sex and age then listeners would only be able to reliably classify taller men (with VTL > ~16 cm) and shorter girls (with VTL < ~10 cm). The sex classification performance of our listeners is much better than this.

## Summary ans Conclusions

Listeners were presented with vowels in a single-interval, two-response paradigm. The listener heard a vowel scaled in GPR and VTL, and had to make one judgement about the size of the speaker (on a 7-point ordinal scale ranging from “very short” to “very tall”) and a second judgement about the sex/age of the speaker (man, woman, boy, or girl). The results from the speaker size judgement experiment show that VTL has a strong influence upon perceived speaker size (Figs 3, 5-6). The strength of this effect presumably reflects the high correlation of VTL with speaker size. The results of the sex/age categorization experiments show that judgements of speaker sex/age are influenced by the interaction of GPR and VTL (Fig. 4). In the normal range of GPR and VTL values, judgements of sex/age are consistent with listeners combining both GPR and VTL information about equally to give a robust indicator of sex and age. When listeners are presented with unusual GPR and VTL combinations, where low GPRs are combined with short VTLs, the VTL information appears to decide the sex/age judgement.

## Acknwoledgements

This research was supported by the UK MRC (G9901257; G9900369) and the German Volkswagen Foundation (VWF 1/79 783). Some of the data were reported in abstract form (Smith and Patterson, 2004b; Smith and Patterson, 2005). We thank Richard Turner for providing the ellipses showing the GPR-VTL values for men, women, boys and girls as derived from the data of Peterson and Barney (1952).

## Notes

1The shape of the vocal tract is largely determined by the placement of the tongue within the oral cavity. The shape affects the positioning of the formants relative to each other – different vowels having different vector angles in a multi-dimensional vowel space. For the purposes of our argument, we assume the same fixed vocal tract shape across all speakers, i.e. the speakers are uttering the same vowel.

2http://www.mrc-cbu.cam.ac.uk/cnbh/web2002/framesets/Soundsframeset.htm. Click on “Scaled vowels”.

3Using the British English meaning of ‘quite’ as meaning ‘to some extent’.

4The GPR and F1-3 formant values of 76 men, women, boys and girls speaking ten vowels were extracted from the Peterson and Barney (1952) vowel data set. Estimates of the inferred VTLs were calibrated against measurements of VTLs taken from magnetic resonance images (Fitch and Giedd, 1999) (Richard Turner, personal communication). Each ellipse represents the mean ± three standard deviations for each category of speaker.

5An estimate of the size of speaker for a given SER was derived by extrapolating from the VTL versus height data in Fitch and Giedd (1999 cf. Fig. 2a). In Fitch and Giedd, the average VTL for 7 men aged 19 to 25 was 15.54 cm. An SER of 0.58 means that the spectrum envelope of the initial input vowel has been compressed by a factor of 1.72 (=1/0.58), while an SER of 2.39 means that the spectrum envelope has been dilated by 0.42. Assuming linear scaling between VTL and formant frequency, these SER values are equivalent to VTL possessed by giants (VTL=26.8 cm) and tiny children (VTL=6.5 cm).

6The two 7 x 7 ranges (cf. Fig. 1) were merged to form one 13 x 13 matrix (the middle row and column of both ranges is the same). Any empty cell in the matrix was filled by the average of all adjoining cells where a speaker size rating had been collected. The data surface was derived by interpolation between the sample points and their averaged neighbors.

## References

Bachorowski, J., and Owren, M. J. (1999). “Acoustic correlates of talker sex and individual talker sex identity are present in a short vowel segment produced in running speech,” J. Acoust. Soc. Am. 106, 1054-1063.

Beckford, N. S., Rood, S. R., and Schaid, D. (1985). “Androgen stimulation and laryngeal development,” Ann. Otol. Rhinol. Laryngol. 94, 634-640.

Childers, D. G., and Wu, K. (1991). “Gender recognition from speech. Part II: Fine analysis,” J. Acoust. Soc. Am. 90, 1841-1856.

Coleman, R. O. (1976). “A comparison of the contributions of two voice quality characteristics to the perception of maleness and femaleness in the voice,” J. Speech Hear. Res. 19, 168-180.

Collins, S. A. (2000). “Men’s voices and women’s choices,” Animal Beh. 60, 773-780.

Darwin, C. (1871). The descent of man and selection in relation to sex (Murray, London).

Dudley, H. (1939). “Remaking speech,” J. Acoust. Soc. Am. 11, 169-177.

Fant, G. (1970). Acoustic Theory of Speech Production 2nd ed. (Mouton, Paris).

Fairchild, L. (1981). “Mate selection and behavioural thermoregulation in Fowler’s toads,” Science 212, 950-951.

Fitch, W. T. (1994). “Vocal tract length perception and the evolution of language,” Ph.D. dissertation, Brown University.

Fitch, W. T. (1997). “Vocal tract length and formant frequency dispersion correlate with body size in rhesus monkeys,” J. Acoust. Soc. Am. 102, 1213-1222.

Fitch, W. T. (1999). “Acoustic exaggeration of size in birds by tracheal elongation: Comparative and theoretical analyses,” J. Zool. 248, 31-49.

Fitch, W. T., and Giedd, J. (1999). “Morphology and development of the human vocal tract: A study using magnetic resonance imaging,” J. Acoust. Soc. Am. 106, 1511-1522.

Fitch, W. T. (2000). “The evolution of speech: a comparative review,” Trends Cog. Sci. 4, 258-267.

González, J. (2004). “Formant frequencies and body size of speaker: a weak relationship in adult humans,” J. Phonetics 32, 277-287.

Hast, M. (1989). “The larynx of roaring and non-roaring cats,” J. Anat. 163, 117-121.

Hollien, H., Green, R., and Massey, K. (1994). “Longitudinal research on adolescent voice change in males,” J. Acoust. Soc. Am. 96, 3099-3111.

Huber, J. E., Stathopoulos, E. T., Curione, G. M., Ash, T., and Johnson, K. (1999). “Formants of children, women and men: The effects of vocal intensity variation,” J. Acoust. Soc. Am. 106, 1532-1542.

Ingemann, F. (1968). “Identification of the speaker’s sex from voiceless fricatives,” J. Acoust. Soc. Am. 44, 1142-1144.

Irino, T., and Patterson, R. D. (2002). “Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform,” Speech Communication 36, 181-203.

Kawahara, H., Masuda-Kasuse, I., and de Cheveigne, A. (1999). “Restructuring speech representations using pitch-adaptive time-frequency smoothing and instantaneous-frequency-based F0 extraction: Possible role of repetitive structure in sounds,” Speech Communication 27(3-4), 187-207.

Kawahara, H., and Irino, T. (2005). “Underlying principles of a high-quality speech manipulation system STRAIGHT and its application to speech segregation,” in Speech separation by humans and machines, P. Divenyi (Ed.), Kluer Academic, Massachusetts, 167-180.

Künzel, H. J. (1989). “How well does average fundamental frequency correlate with speaker height and weight?” Phonetica 46, 117-125.

Lass, N. J., and Davis, M. (1976). “An investigation of speaker height and weight identification,” J. Acoust. Soc. Am. 60, 700-703.

Lass, N. J., and Brown, W. S. (1978). “Correlational study of speakers’ heights, weights, body surface areas and speaking fundamental frequencies,” J. Acoust. Soc. Am. 63, 1218-1220.

Lass, N. J., Hughes, K. R., Bowyer, M. D., Waters, L. T., and Bourne, V. T. (1976). “Speaker sex identification from voiced, whispered, and filtered isolated vowels,” J. Acoust. Soc. Am. 59, 675-678.

Liu, C., and Kewley-Port, D. (2004). “STRAIGHT: a new speech synthesizer for vowel formant discrimination,” Acoustic Research Letters Online 5, 31-36.

Morton, E. S. (1977). “On the occurrence and significance of motivation-structural rules in some bird and mammal sounds,” American Naturalist 111, 855-869.

Narins, P. M., and Smith, S. L. (1986). “Clinal variation in anuran advertisement calls—basis for acoustic isolation,” Behav. Ecol. Sociobiol. 19, 135-141.

Negus, V. E. (1949). The Comparative Anatomy and Physiology of the Larynx (Hafner, New York).

Peterson, G. E., and Barney, H. L. (1952). “Control methods used in a study of the vowels,” J. Acoust. Soc. Am. 24, 175-184.

Rendall, D., Owren, M. J., Weerts, E., and Hienz, R. D. (2004). “Sex differences in the acoustic structure of vowel-like grunt vocalizations in baboons and their perceptual discrimination by baboon listeners,” J. Acoust. Soc. Am. 115, 411-421.

Riede, T., and Fitch, W. T. (1999). “Vocal tract length and acoustics of vocalization in the domestic dog Canis familiaris,” J. Exp. Biol. 202, 2859-2867.

Sachs, J., Lieberman, P., and Erickson, D. (1973). “Anatomical and cultural determinants of male and female speech,” in Language Attitudes: Current Trends and Prospects, R. W. Shuy and R. W. Fasold (Ed.), Georgetown University Press, Washington, D.C.

Schwartz, M. F. (1968). “Identification of speaker sex from isolated, voiceless fricatives,” J. Acoust. Soc. Am. 43, 1178-1179.

Schwartz, M. F., and Rine, H. E. (1968). “Identification of speaker sex from isolated, whispered vowels,” J. Acoust. Soc. Am. 44, 1736-1737.

Smith, D. R. R., Patterson, R. D., and Jefferis, J. (2003). “The perception of scale in vowel sounds,” British Society of Audiology, Nottingham P35.

Smith, D. R. R., and Patterson, R. D. (2004a). “The existence region of scaled vowels in pitch-VTL space,” 18th Int. Conference on Acoustics, Kyoto Japan, vol. I, 453-456.

Smith, D. R. R., and Patterson, R. D. (2004b). “The perception of sex and size in vowel sounds,” British Society of Audiology, UCL London P49.

Smith, D. R. R., and Patterson, R. D. (2005). “Perception of speaker size and sex of vowel sounds,” J. Acoust. Soc. Am. 117, 2374.

Smith, D. R. R., Patterson, R. D., Turner, R., Kawahara, H., and Irino, T. (2005). “The processing and perception of size information in speech sounds,” J. Acoust. Soc. Am. 117, 305-318.

Titze, I. R. (1989). “Physiologic and acoustic differences between male and female voices,” J. Acoust. Soc. Am. 85, 1699-1707.

Turner, R. E., and Patterson, R. D. (2003). “An analysis of the size information in classical formant data: Peterson and Barney (1952) revisited,” J. Acoust. Soc. Jpn. 33, 585-589.

Turner, R. E., Al-Hames, M. A., Smith, D. R. R., Kawahara, H., Irino, T., and Patterson, R. D. (2005). “Vowel normalisation: Time-domain processing of the internal dynamics of speech,” in Dynamics of Speech Production and Perception, edited by P. Divenyi (IOS Press) (in press).

van Dommelen, W. A., and Moxness, B. H. (1995). “Acoustic parameters in speaker height and weight identification: sex-specific behaviour,” Language and Speech 38, 267-287.

Wu, K., and Childers, D. G. (1991). “Gender recognition from speech. Part I: Coarse analysis,” J. Acoust. Soc. Am. 90, 1828-1840.

Whiteside, S. P. (1998). “Identification of a speaker’s sex from synthesized vowels,” Percept. Mot. Skills 86, 595-600.