Here, I focus on how we can analyse, visualize and synthesize sound, or more specifically the timbre of instruments.
Pitch and Timbre Perception
Our perception of music is based on the grouping of frequencies in time and space. That is why a set of frequencies can be heard as a specific tone with an associated pitch, loudness and timbre. Such grouping is done by relating frequencies that have their origin close in spatial location, have similar onset time, and move in the same direction. The problem, however, is that there are no computational tools that can do this in an immediate and straight forward way like the human brain.
There is not even an easy way for computers to find the perceived pitch from a set of frequencies. Usually the pitch is associated with the lowest frequency, but this is not always the case. Sometimes we can hear a pitch that is not physically present in the sound. Such cases of a “missing fundamental” may be caused by filtering, for example if the sound had to travel through a wall or a device that does not carry the lowest frequencies, for example a telephone or radio. In these cases, the fundamental is usually reconstructed in our hearing from the spectrum, i.e. we will tend to “find” the pitch from the harmonics. This way we can still enjoy music and have a perception of the correct pitch.
Similarly, it is also difficult to handle a concept such as timbre by computational tools. Despite our excellent ability for discerning and recognizing timbre, it still remains to find an easy way of describing it, both in normal language and in physical terms. Usually we refer to the sound source to describe the timbre, e.g. “piano-sound” because it was created by a piano, but there is for instance no simple way of finding “piano-sound” from a sound file.
A reason why it is difficult to analyse timbre using physical and mathematical terms, is probably because it depends on so many parameters. In a classical experiment, Grey (1977) showed how 16 different instrument timbres can be categorized in a simplified and three-dimensional timbre space. The three dimensions of this space were:
Axis I: Spectral energy distribution. This gives sounds ranging from dull to sharp. For example, the French horn is an instrument with a dull sound, while the oboe is much sharper.
Axis II: Synchronicity in harmonic transients, and decay of upper harmonics. This is related to the spectral fluctuation through time. Woodwinds have upper harmonics that enter, reach maxima, and exit in close alignment. The strings are on the other extreme and have harmonics which do not have such synchronicity.
Axis III: Amount of high-frequency inharmonic noise in the attack. Strings, flutes and clarinets have high-frequency, low-amplitude, and inharmonic energy in the attack, while brass and bassoons have low-frequency inharmonicity and no high-frequency energy in the attack.
Grey’s results were used by Wessel (1979) when he showed that Grey’s timbre space could be used as a control structure. Wessel operated with two axes, one with the spectral energy distribution controlling the brightness of the tone, and another with the frequency transients in attacks controlling the “bite” of the tone. The problem, however, with both these approaches is that they are based on psychological experiments on a limited number of instruments. As such, they do not represent methods that can easily be used for classifying timbre directly from an acoustical signal (Cosi, De Poli, and Lauzzana 1994). To exemplify some of the problems related to timbre analysis and recognition I will look at various ways of analysing a saxophone tone.
Analysis of a Saxophone Tone
Example 15a is a 2,7 second saxophone tone played by Sony Rollins. A subjective description of the sound might be that it is a single-pitched tone with slight changes in timbre and loudness towards the end of the sound. How can this be related to the physical signal?
The figure below shows a time-domain plot of the sound and we can see that there is clearly a decrease in the amplitude. In this case the amplitude curve fits well with our perception of the tone, also the slight decrease down to a sudden “fall-off” at around 1.7 seconds. Even though there is not a one-to-one correspondence between the physical amplitude and our perceived loudness of a sound, the plot matches our perception quite well.
Amplitude-domain representation of the 3 second Sony Rollins saxophone tone.
An automatic pitch extraction from this sound, using Addan 3.027, is shown in the plot in the figure below. It is interesting to notice that when listening to the example and comparing our sensation of pitch with the measurements here, we notice that the software has found a pitch one octave lower than the perceived pitch of Ab4 (approx. 420Hz). That is a quite common “mistake” done by pitch trackers, since most algorithms only look for the lowest frequency in the sound.
Estimate of the fundamental frequency of the Rollins tone. The y-axis is cropped for the sake of clarity. Notice how much the frequency changes.
The second interesting thing to notice is how much this estimated fundamental frequency (F0) changes throughout the plot. The difference between the lowest and highest value is almost 12 Hz, and that corresponds to a difference larger than a semitone. There seems to be a local peak at around 1.7 seconds and a global peak at around 2.2 seconds. When listening to the tone, I do think that the pitch changes slightly, but not as much as is suggested in the figure below. The reason for this might be that the pitch extraction and/or our perception is “imprecise”.
A reason for the “problems” with such pitch extraction might be found by investigating the rest of the frequencies we hear. Let us first look at a plot of the amplitudes of each of the harmonic frequencies (the figure below). As can be seen, the harmonic close to 800 Hz is very prominent. There are also quite high values for the frequencies in the sensitivity region, so they will probably be perceptually louder than the harmonics around 200, 400 and 600 Hz. So for this tone the perceptually loudest harmonic frequencies are much higher than the fundamental.
Plot of the spectrum of the Rollins tone. The graph shows the average amplitude for each frequency in the spectrum. Notice how each of the harmonics are clearly visible, and that the harmonic close to 800 Hz is the one with highest amplitude.
The same thing can be seen from the spectrogram (the figure below), where there seems to be a lot of spectral energy in the higher harmonics (the darker regions in the plot). Notice also that there is more energy in the higher frequencies in the beginning of the sound. Then, at around 1.7 seconds, there is a sudden decrease of energy across the whole spectrum. This corresponds well to how we perceive the tone to become “duller” at this point.
Spectrogram (frequency vs. time) of the saxophone sound.
As suggested by Wessel (1979), the perceived brightness of a tone is related to the spectral energy distribution. Beauchamp (1982) further suggested that the relationship between intensity and the spectral centroid may be an important perceptual correlate of timbre. Thus a tone which seems “brighter” has a higher spectral centroid. The spectral centroid can be found from the physical signal as the mean of the spectral energy distribution, or the “balance point” of the spectrum. This is accomplished by summing over pairs of amplitude and frequency for a given time window, where a is amplitude and f is frequency:
$$\text{spectral centroid} = \frac{\sum_{i=1}^{N} a_i f_i}{\sum_{i=1}^{N} a_i}$$
Spectral centroid of the Rollins tone.
The figure below is a plot of the spectral centroid of the Rollins tone, where more energy in the higher frequencies indicates a brighter tone. So from this graph we can conclude that the brightness should be at a minimum at about 1.7 seconds, and this fits well with what we hear.
Relevance of the Harmonics
Looking at all these graphs and plots, a relevant question might be whether we actually perceive all the information that is displayed. Do we actually need all the highest frequencies? The figure below shows a spectrum plot of the Rollins tone (Example 15a), and from this plot it seems that only the 15 first harmonics have relatively large values. One reason for this is that the values are plotted on a linear scale, and a logarithmic scale would have shown a less drastic curve. But anyway we could ask ourselves whether the 15 first harmonics would be sufficient to adequately describe the tone?
Linear spectrum plot of the Rollins tone, showing the relative amplitude over frequency.
The best way of testing this is by trying to synthesize the tone and see how it is perceived. So I did frequential analysis of the tone in Addan, extracting all the harmonic frequencies with corresponding amplitudes. To check the precision of the analysis, the tone was synthesized directly from the analysis values. If the analysis is good, the resynthesized tone should have few deviations from the original. Indeed, the synthesized tone (Example 15b) sounds quite like the original (Example 15a). This can further be checked by subtracting the synthesized tone from the original, and the result of this could be considered noise and deviations. Example 15c shows that the noise is almost inaudible, and this is another proof that the analysis is good.
To test how many harmonics are necessary to adequately reproduce the sound, I made the patch Play-Add-Files (the figure below). This program plays back analysis files saved in the SDIF file format28. In the patch the user can adjust the number of partials to be played back and the corresponding original sound file is also loaded so it is possible to check the result.
Screenshot from the patch Play-Add-Files that plays SDIF-files using additive synthesis.
The patch uses the CNMAT SDIF-menu to load analysis files into the SDIF-buffer. The SDIF-tuples object has a nice feature for retrieving a certain number of rows or columns of data from an SDIF-file, and in this patch the number of harmonics to be played back can therefore easily be constrained. The additive synthesis is done with the sinusoids~ object, such as presented in previous chapters. Unfortunately, it was not possible to make a standalone application of this patch, due to the many specialized external objects. However, the patch may be inspected and tested if opened in MAX/MSP 4.
Selections of five different saxophone phrases by Sony Rollins are included in the patch. Some of these phrases are more difficult than others, in the sense that they contain large interval changes and noise that causes glitches and problems in the analysis. This is audible in some cases, but does not really inflict so much on the overall point I am trying to make, namely that of the number of harmonics necessary to adequately synthesize the sounds. Testing with different settings, it seems quite clear that 15 harmonics are not sufficient to give a good approximation of the timbre of the instrument. On the other hand, I would argue that choosing more than 60 partials does not give so much extra information. It is also interesting to try and play only one or two partials, and hear how the analysis sometimes has left out the correct fundamental. This can be heard in Example 16 where nine versions of the Rollins tone are presented, each with a different number of harmonics. A spectrogram showing these tones is displayed in the figure below, and it is easy to see the reduced amount of partials.
Spectrogram of the saxophone tone played with fewer and fewer partials. Notice how visible the reduction of harmonics in the spectrum is.
From the spectrogram we can also find the reason for our observation of the leap in perceived loudness between the tones with 5 and 15 partials. This is due to the fact that partials number 11-15 lie in the sensitivity region of the ear.
Perceptual Models
Up until this point I have presented various ways of visualizing music directly from the physical signal. However, there is also the possibility to incorporate perceptual models before displaying the signals. The IPEM-toolbox29 for Matlab is a collection of tools for doing music analysis based on a perceptual framework (Leman, Lesaffre, and Tanghe 2001a). This toolbox uses an Auditory Peripheral Module adapted from the Van Immerseel and Martens model, involving several stages of filtering similar to how our ear works:
Simulation of filtering in the outer and middle ear.
Simulation of the filtering in the inner ear, using an array of band-pass filters.
Simulation of a hair cell model where the band-pass filtered signals are converted to neural rate-code patterns.
The output of the Auditory Peripheral Module is an auditory nerve image of a sound, or “a kind of physiological justified representation of the auditory information stream along the VIIIth cranial nerve” (Leman, Lesaffre, and Tanghe 2001b: 18). Such a primary image thus represents excitation in various channels in the auditory system, and an example of how this looks for the Rollins tone (Example 15a) is shown in the figure below. Notice how it is possible to see a decline in the energy levels through time, similar to the decrease in the spectrogram showed in the figure below.
Auditory Nerve Image (ANI) of the Rollins tone.
In the IPEM-toolbox there are a number of modules based on the Auditory Peripheral Module, but I will only mention the Roughness Module here. Roughness or sensory dissonance was introduced by Helmholtz as a description of the texture of a sound dependent on impure or unpleasant qualities, and it can be defined as the energy of the relevant beating frequencies in the auditory channels (Leman 2000). As such, roughness is considered to be highly related to micro-level texture perception. The upper section of the figure below shows the energy distributed over the auditory channels of the Rollins tone (Example 15a). The middle section shows how the energy of the beating frequencies contributes to the roughness curve shown at the bottom. In this example, though, there is not much change in pitch, and thus the roughness varies very little.
Output of the Roughness Module of the IPEM-toolbox for the Rollins tone. The upper section shows the energy distributed over the auditory channels, the middle section shows the contribution of the beating frequencies to the roughness curve at the bottom.
The idea of showing this example is that the model actually manages to recognize a “constant” pitch with slightly changing texture. So this example shows that new computer models seem promising when it comes to using a perceptual model for extraction of perceptually relevant information. Such an approach will hopefully result in major advances in the study of music in the coming years.
Conclusions
This chapter has presented various ways of analysing and displaying musical sound. This was exemplified with a seemingly simple saxophone tone lasting less than 3 seconds. However, as it turned out, this short example was in no way simple and well defined. The pitch tracker hit one octave off, and the spectrogram revealed quite large shifts in spectral energy throughout the sound. I also showed that about 60 harmonic frequencies are necessary to adequately synthesize the tone.
So the main conclusion from this discussion is that what might initially be thought of as a single tone with a slightly changing timbre is in no way easy to define and describe in physical terms. Furthermore, since Krumhansl and Iverson (1992) found that pitch and timbre actually interact for isolated tones, I agree with Houtsma (1997) in suggesting that the concepts of pitch and timbre should never be presented as independent variables. Developing perceptual models that take the multi-dimensional characteristics of sound into account is therefore important, and the IPEM-toolbox is a good example of such an implementation.
This text is part of my master’s thesis.