MP3: Friend of the Youth or Enemy of the Sound? A discussion of different sound formats and problems with sound compression

May 30, 2000

Term paper in “Musikkvitenskap mellomfag” spring 2000

Abstract

The paper starts with presenting some of the concepts behind digital audio compression, before describing some of the most popular sound formats available today: the different standards in the MPEG-family, RealAudio, ATRAC, MS Audio, SACD and DVD Audio. The author argues that there are lots of positive aspects of sound compression, but perhaps this overwhelming popularity will limit the development of new and better standards, like Super Audio CD or DVD Audio.

Introduction

The last years have shown a growing amount of various multimedia standards and applications, like MP3, MPEG, MD, DVD, DAB, AC-3 and RealAudio. Similar for all of them is the dependency on sound compression during digital transfer, and they have all been applied to a wide range of applications (Brandenburg and MPEG-2 FAQ):

Broadcasting: Digital Audio Broadcasting (DAB, ADR, Worldspace Radio, Cable Radio, Internet Radio), cable and satellite TV (DVB, USSB, DirecTV, EchoStar)
Storage: Digital Video (DVB, video CD, DVD), Digital Compact Cassette (DCC), Solid State Storage Audio, Portable music devices (MP3-players)
Multimedia: Computer based Multimedia (e.g. Java, Flash, games, consumer programs), multimedia on the Internet
Telecommunication: ISDN transmission, contribution links, distribution links

All the big companies behind the different standards claim that their product provides the best HI-FI quality at the lowest bit rate. But how do these standards actually work and which one is better for what use?

I will start off by briefly describing some concepts of digital audio compression, and how insight into psychoacoustics can help produce transparent sound compression. I assume the reader to have basic knowledge of digital signal processing, and will therefore not define standard concepts. Then I present some of the most popular sound formats, both those intended mainly for Internet usage and those giving high quality sound. Finally, I will discuss how the enormous popularity of standards using sound compression might result in unconsciousness about sound quality, and how this can limit the development of better standards. It is then interesting to pose the question: is MP3 the friend of the youth or the enemy of the sound quality?

Principles of Digital Audio Compression

With analog systems the different possibilities of audio quality was basically limited to choosing between stereo or mono, and the quality of the tape. Unlike the virtually “infinite” quality of analog systems, digital signals are dependent on the conflicting interests of high sampling rates versus small storing space. When Sony/Philips introduced the CD in 1980, they settled at a standard of 44,1 kHz and 16 bit. This confirms with the concept “Nyman frequency” telling us that the sampling frequency has to be minimum twice the highest frequency in the signal to avoid distortion (Jensenius, 1999). Since the human ear is capable of hearing sounds up to 20 kHz, the CD-medium should be able to present all frequencies audible to the human ear.

The audio on a CD is stored in a format called Pulse Code Modulation (PCM), where each sample is represented as an independent code (Pan, 1993). This requires a huge amount of samples to reproduce a good signal. We can easily calculate the amount of storage space necessary to save one minute of CD-quality sound, when we know that there are 44 100 samples every second, and that there are eight bits per byte:

44 100 samples/s * 2 channels * 2 bytes/sample * 60 s/min = 10 MB/min

If we were to have such audio files on the Internet, it would take up to an hour just to download one minute of high quality music, using a conventional modem. Clearly, it was necessary to develop systems to compress the sound while keeping up a high sound quality.

2.1 Lossless Coding

An ideal coding scheme allows for reconstruction of the original signal. One method of perceiving this is by dividing the signal up into 4 categories: irrelevant, redundant, relevant, and not redundant. The scheme will then remove either the amount of irrelevant or redundant information or both. This type of encoder can give a compression ratio of 1:2 up to 1:3,5, dependent on the signal, and still be able to fully reconstruct the original sound (Erne, 1998: 152). Different encoders use both linear prediction and a transformation with entropy encoding (for example Huffmann). The linear predictor minimises the variance of the difference in signals between samples. Then the entropy coder allocates codewords to the different samples (ib.), so that they can be reproduced in the correct order.

2.2 Psychoacoustics

During the years scientists have discovered a range of disabilities in the human ear. These prove extremely useful when compressing sound, as the whole idea of psychoacoustic models is to determine what parts of a sound are acoustically irrelevant.

An interesting result is that the sensitivity of the ear varies with frequency. The ear is most sensitive to frequencies in the neighbourhood of 4 kHz. Thus some sound pressure levels that can be detected at 4 kHz will not be heard at other frequencies. This also means that two tones of equal power but different frequency will probably not sound equally loud. Equi-loudness curves showing this effect is graphed in Figure 1a. The dashed curve indicates the minimum level at which the ear can detect a tone at a given frequency (Tsutsui, 1992). Filters based on this concept are used in most coding algorithms.

Another important concept is that of auditory noise masking. A perceptual weakness of the ear occurs whenever the presence of a strong audio signal makes a spectral neighbourhood of weaker audio signals imperceptible (Pan, 1993: 6). For a certain period of time only the strongest tonal signal may necessarily be presented, because the weaker signals will not be audible anyway. Look at the examples of simultaneous masking and temporal masking in Figure 1b and 1c. From these we can conclude that simultaneous masking is more effective when the frequency of the masked signal is equal to or higher than that of the masker. As well, forward masking can be effective for a longer time after the masker has stopped than the backwards masking. Both these concepts greatly help to compress the sound signal.

Figure 1: a) Equi-loudness curves b) Simultaneous masking curve c) Example of Temporal mask-ing (Tsutsui 1992)

The concept of dividing the spectrum into critical bands, is explained by the ear’s tendency to analyse the audible frequency range using a set of subbands. These subbands can be thought of as the frequency scale used by the ear. The frequencies within a critical band are similar in terms of the ear’s perception, and will therefore be processed separately from sound in the other critical bands. As we see from Table 1, the critical bands are much wider for higher frequencies than for lower. This means that the ear receives more information from the low frequencies than from the higher (Tsutsui, 1992), and this should be thought of when deciding what parts to compress the most in a signal.

Figure 2: Model of MPEG-1 audio encoding (MPEG Audio FAQ). — *Table 1: Critical bands (Tsutsui, 1992)*

MP3: Friend of the Youth or Enemy of the Sound? A discussion of different sound formats and problems with sound compression

Abstract

2.1 Lossless Coding

2.2 Psychoacoustics

3.1 MPEG-1

3.2 MPEG-2

3.3 RealAudio G2

3.4 Microsoft Audio v4.0

3.5 Minidisc/ATRAC

3.6 MPEG-4

3.7 MPEG-7

3.8 DVD Audio

3.9 SACD

Bibliography