Category:Application of HSR to ASR
From CNBH Acoustic Scale Wiki
Perceptual experiments with communication sounds show what everyone intuitively knows; auditory perception is singularly robust to changes in both the resonance rate and the pulse rate of a communication sound Smith et al. (2005), Smith and Patterson (2005), Ives et al. (2005), van Dinther and Patterson (2006), Smith et al. (2007). The experiments show that we have no difficulty whatsoever understanding when a child and an adult have spoken the same speech sounds (syllables or words), despite substantial differences in pulse rate and resonance rate of the waves carrying the message. We also know which speaker has the higher pitch and which speaker is bigger (i.e., which speaker has the longer vocal tract). Perceptual experiments have been performed with vowels Smith et al. (2005) syllables Ives et al. (2005), musical notes van Dinther and Patterson (2006) and animal calls; they all lead to the conclusion that auditory perception is singularly robust to the scale variability in communication sounds. It is also the case that the robustness of human perception extends to speech sounds and musical sounds scaled well beyond the range of normal experience Smith et al. (2005), van Dinther and Patterson (2006), which suggests that the robustness is based on automatic adaptation or normalization mechanisms rather than learning. A description of how the auditory system might perform the necessary normalization is presented in The robustness of bio-acoustic communication and the role of normalization.
The robustness of auditory perception stands in contrast to the lack of robustness in mechanical speech recognition systems; a speech recognizer trained on the speech of a man is typically not able to recognize the speech of a woman, let alone the speech of a child. Thus, the robustness which we take for granted and think of as trivial, poses a very difficult problem if it is left to the recognition system that follows the pre-processor to learn about pulse rate and resonance rate variability from a time-frequency representation like the spectrogram.
The category HSR for ASR focuses on the application of knowledge about Human Speech Recognition (HSR) to Automatic Speech Recognition (ASR).
Contents |
Project Reports
Establishing Norms for the Robustness of Automatic Speech Recognition
Establishing Norms for the Robustness of Human Speech Recognition
Estimating Vocal Tract Length from a Stream of Vowel Sounds
Excerpts from published papers
Low-Dimensional, Auditory Feature Vectors that Improve VTL Normalization in Automatic Speech Recognition
Research projects
Scale-Covariant Features for Automatic Speech Recognition
Published papers for the Category: Application of HSR to ASR
Comparing the Robustness of HSR and ASR: Monaghan et al. (2008)
References
- Ives, D.T., Smith, D.R.R. and Patterson, R.D. (2005). “Discrimination of speaker size from syllable phrases.” J. Acoust. Soc. Am., 118, p.3816-3822. [1] [2]
- Monaghan, J.J., Feldbauer, C., Walters, T.C. and Patterson, R.D. (2008). “Low-dimensional, auditory feature vectors that improve vocal-tract-length normalization in automatic speech recognition.” J. Acoust. Soc. Am., 123, p.3066. [1]
- Smith, D.R.R., Patterson, R.D., Turner, R.E., Kawahara, H. and Irino, T. (2005). “The processing and perception of size information in speech sounds.” J. Acoust. Soc. Am., 117, p.305-318. [1] [2] [3]
- Smith, D.R.R., Walters, T.C. and Patterson, R.D. (2007). “Discrimination of speaker sex and size when glottal-pulse rate and vocal-tract length are controlled.” J. Acoust. Soc. Am., 122, p.3628-3639. [1]
- Smith, D.R.R. and Patterson, R.D. (2005). “The interaction of glottal-pulse rate and vocal-tract length in judgements of speaker size, sex, and age.” J. Acoust. Soc. Am., 118, p.3177-3186. [1]
- van Dinther, R. and Patterson, R.D. (2006). “Perception of acoustic scale and size in musical instrument sounds.” J. Acoust. Soc. Am., 120, p.2158-76. [1] [2] [3]
Pages in category "Application of HSR to ASR"
The following 4 pages are in this category, out of 4 total.
E
- Establishing Norms for the Robustness of Automatic Speech Recognition
- Establishing Norms for the Robustness of Human Speech Recognition
- Estimating Vocal Tract Length from a Stream of Vowel Sounds