Musical Tones Discovered in Speech
4 Jun, 2007 12:25 pm
A striking feature of music is the use of 12 specific tone intervals in musical composition and performance across many human cultures. This phenomenon has never been explained. A paper published in the 5 June 2007 issue of the Proceedings of the National Academy of Sciences provides evidence that the tones used in music are based on the nature of speech stimuli that all humans are routinely exposed to.
The evidence for these conclusions was generated by recording native English and Mandarin speakers uttering vowel sounds in both single words and a series of short monologues; the researchers then compared the vocal tract frequency ratios in speech sounds to the numerical ratios that define the notes in music. Human vocalization begins with the vocal cords in the larynx, which create a series of power peaks in the air stream coming from the lungs (the peaks are called harmonics, and are characteristic of vibrating objects). The changing shape of the vocal tract above the larynx systematically modifies these power peaks as we speak to produce specific resonances that create different speech sounds. The configuration of the vocal is determined by the shape of the throat cavity, the position of the soft palate and the position of the tongue and lips, all of which are neurally controlled by the language centers in the brain. These dynamic changes produce the different vowel sounds we use to communicate. This vocal anatomy is thus rather like an organ pipe that can be pinched, stretched and widened on the fly, thus changing the resonance of the pipe in a flexible yet precisely controlled manner.
Despite the wide variation in individual human anatomy, the speech sounds produced by different speakers and languages produce the same variety of vocal tract resonance ratios. The lowest two of these vocal tract resonances, called formants, account for our ability to understand the vowel sounds in speech. If one takes away the first two formants a listener can't understand what the speaker is saying (this experimental modification is easy to implement with digital recordings). The frequency of first formant is between 200 and 1,000 cycles per second and the second formant between 800-3,000. Air moving through a 17 centimeter tube closed at one end (which is the approximately the default configuration of the average human vocal tract) produces about the same resonances, and is often used to demonstrate the physical basis the observed resonant ranges of the vocal tract.
When the researchers looked at the ratios of the first two formants in speech spectra, they found that the ratios formed musical relationships. For example, the relationship of the first two formants in the English vowel /a/, as in "bod," might correspond to the musical interval between C and A on a piano keyboard. About 70% of the formant ratios in vowel sounds represented musical intervals. This predominance of musical intervals hidden in speech suggests that the chromatic scale notes in music are specifically appealing because they match the formant ratios we are exposed to all the time in speech, even though listeners are quite unaware of this exposure.
No music, except modern experimental pieces, actually uses all 12 tones of the chromatic scale. Most classical music uses the 7-tone diatonic scale to divide octaves, and much of folk music uses just five tones. These preferences correspond to the most prevalent formant ratios in speech.
The investigators also surmise that these findings could resolve a centuries-old debate in music over the most appropriate tuning scheme for instruments. Ten of the 12 harmonic intervals identified in English and Mandarin speech occur in "just intonation" tuning (which to most trained musicians sounds better than other tuning systems). They found fewer correspondences in other tuning systems, including the equal temperament tuning commonly used today. Equal temperament tuning, in which each of the 12 intervals in the chromatic scale is made exactly the same, is a scheme that allows an ensemble such as an orchestra to play together in different keys and across many octaves. Although equal temperament tuning sounds quite good, it is nonetheless a compromise, the vocally derived just intonation tuning system being more “natural” (simply meaning the one that we hear all the time in speech).
Reference:
Ross D., et al, “Musical Intervals in Speech”, PNAS, 5 June 2007, vol. 104: 9852-9857