Here I discuss neural networks as one specific type of computer model that learns to process information in a way that may be similar to human perception. Theoretical background for neural networks and an example of training feedforward networks with timbre will be presented.
Connectionist vs. Symbolic Models
Previous chapters have shown the multi-dimensionality and complexity of auditory signals, and some of the difficulties when it comes to analysis and representation of music from such a sub-symbolic input. But since music perception seems to be a quite “easy” task for humans, we should try to make computer models work in a way similar to the human brain. Therefore I chose to look at artificial neural networks, and how they can be used to simulate neural activity. Before going into more detail about neural networks, it is worth mentioning that such networks are but one of many different models of “intelligent” computational systems. Such models in the world of “artificial intelligence” seem to be divided in two major directions; symbolic and rule-based models on one side, and connectionist models on the other.
Symbolic models are based on sets of rules for describing structures and processes. Take for example language, where grammar is the set of rules that govern how sentences should be generated and interpreted. To be able to speak or understand a language, it is necessary to know the basics of syntax, and semantics. Thus knowledge of the rules is essential for understanding or creating meaningful sentences. A computer model can therefore be intelligent if it knows the rules and will be able to both interpret and produce valid sequences within the boundaries of the system. Examples of rule-based systems in music composition are counterpoint, Bach choral harmonization and dodecaphony. There are also many examples of rule-based systems for music analysis, for instance (Lerdahl and Jackendoff 1983). The problem with most such systems, though, is that they will always be incomplete in the sense that they often assume additional knowledge which is not formalized.
Connectionism, on the other hand, is a direction in cognitive science with the aim of explaining human intellectual abilities using artificial neural networks (often referred to as neural networks). It is a relatively new direction, if one counts the work of (Rumelhart, McClelland, and others 1988) as when this research program came to the front stage (de Sousa 1995). Neural networks are simplified models of the brain, composed of a large number of units connected together with weights. Those weights are measurements of the strength of connections between units in the network (Garson 2002). This is why connectionist models are still often referred to as parallel distributed processing, because the information that is contained in the system is not localized in single units in memory, but rather “stored” by activation throughout the whole network (Rumelhart, McClelland, and others 1988). This is similar to how the human brain is believed to function, and is thus also sometimes referred to as “neuromimetic” modelling.
Spangler (1999) argues that, when dealing with music, rule-based algorithms have several advantages. First of all, he claims that almost all music is rhythmic and tonal, and can therefore be measured in terms of quantized pitch and duration. Since rule-based systems are inherently discrete they will be able to account for this. He further argues that in a rule-based system it is straightforward to find the “reasoning” behind an output of the system, and therefore it is easier to determine where mistakes are made and what changes should be made to the algorithm. This is as opposed to a connectionist system where it might be very difficult to determine why a multi-layer network, based on dynamical principles, makes a given decision. He therefore concludes that rule-based systems are better when it comes to music analysis and creation.
A weakness with a rule based system, however, is the fact that it is limited by its rules, so the system will never be able to go beyond its boundaries. As such, it might work well in making strict counterpoint or analyse Bach chorales, but it will have serious problems when presented with elements that have not been, and/or cannot be, formalized in the system.
Another problem with rule-based systems is that they work serially. Just like a serial computer it can only do one operation at a time (Edelman 1992). This means that even though computers are getting faster and more powerful, the system will be slowed down because it always has to compare a target with each item in “memory” before finding a solution. This might secure accuracy within the system, but it will always be bound by its inability to “reason” and learn from experience.
These weaknesses support the claim that such serial processes are fundamentally different from processes of the human brain, since a rule based model is serial and symbolic, while the brain is believed to be a parallel, and distributed system (Smolensky, Mozer, and Rumelhart 1996). Even though the brain cannot perform calculations as fast as computers, it still has the ability to reason across all boundaries and quickly recognize and understand complex structures. And if there are “holes” in our memory, by for example lack of information, we are still able to reason and understand structures.
In recent years, attempts to combine the central principles of computation in connectionist networks with that of symbolic computation, has resulted in the optimality theory (Prince and Smolensky 1993). It will be interesting to see this theory applied on a musical material.
The Self-Organizing Map
Before going into more detail about feedforward neural networks, I will just briefly mention one of the more popular connectionist models in recent years, the Self-Organizing Map (SOM). The SOM was intended as an effective software tool for conversion of nonlinear and high-dimensional data into simple geometric relationships on a low-dimensional display (Kohonen 2001: 106). This is interesting since it allows us to “see” for example an eight-dimensional structure when it is represented in a two-dimensional map.
The SOM is a method of unsupervised learning, meaning that the model is given a set of inputs and has to organize the content based on similarity within the data sets. Its great strength is the ability to learn structures in highly scattered and nonlinear material, and organize such large and complex sets into maps where items with similar features are plotted close to each other. In such cases where the data cannot be easily described in terms of mathematical functions, the SOM may relatively easily “see” the structures. A SOM is thus a simple abstraction of complex data, and has proven to be valuable in a number of complicated tasks. the figure below shows an example of a SOM where countries are mapped according to living conditions based on a 39-dimensional data set (with information such as state of health, nutrition, educational services, etc) from the World Bank statistics of 1992.
The popularity of the SOM algorithm has increased considerably the last years, and today it seems to be used in a wide variety of disciplines. Related to the field of this project is the use of SOMs for categorizing timbre in (Cosi, De Poli, and Lauzzana 1994) and (Feiten and Günzel 1994), in the development of analytical tools for speech processing (De Poli and Prandoni 1997), and in various types of music analysis, for example (Leman 1995). For this project, however, I decided to test out feedforward neural networks.

Example of a SOM where countries are organized according to a 39-dimensional data set indicating living conditions. Reproduced from (Kaski 1997).
Feedforward Neural Networks
Another popular connectionist model is the feedforward neural network (the figure below). As opposed to SOMs, feedforward networks are based on supervised learning. That is, the network is trained with sets of both input and output data, so it learns specific outputs for given sets of input values. The following presentation is based on (Wasserman 1989).

the figure below A fully connected multi-layer feedforward neural network consisting of an input, hidden and output layer.
A fully connected feedforward network is shown in the figure below. Such a network is based on layers of neurons connected with weights. The neuron can be thought of as a very simple “computer”, since it has the ability to receive, process and transmit signals to other neurons. The weights between neurons govern how the network will process information, and a network is learning by changing these weights. In a feedforward network, all connections between neurons are going in the same direction, and the network therefore “feeds” its information forward.
The input to the neuron comes from the neurons preceding it in the net. Receiving an input, it sums its input and automatically makes an output when that input reaches a certain level. When the neuron outputs, or “fires”, it influences the neurons it is connected to further on in the chain. A sketch of a simple artificial neuron is shown in the figure below.
Artificial neuron with activation function (Wasserman 1989).
The inputs $x_1, x_2, \dots, x_n$ applied to the neuron come from other neurons preceding it in the network. Each input is multiplied by a corresponding weight $w_1, w_2, \dots, w_n$, as an analogy to the strength of that link. In the neuron, the weighted inputs are summed to find the activation level of the neuron. In compact vector form this is shown in the figure as $\mathrm{NET} = \mathbf{X}\mathbf{W}$ (here, scalars (a quantity that has magnitude, but not direction) are shown in normal type; vectors (a quantity that has both magnitude and direction) are shown in bold). The function governing the output of the neuron is represented with $F(t)$. This can either be a linear function or a nonlinear threshold function, where $t$ is some constant threshold value. Exactly what type of function that can be used depends on the paradigm and the algorithm being used (Kartalopoulos 1996). For now, I will just mention that two of the most popular threshold functions are the sigmoid and the hard limiter (the figure below). A sigmoid function will output any value between $0$ and $1$ and is popular because of this monotony and because it has a simple derivative.

Two different types of threshold functions, the sigmoid and hard limiter.
A hard limiter, on the other side, is discontinuous at the origin, and is linear between its upper and lower bounds. It will output either $0$ or $1$, for example such that
$$OUT = 1 \text{ if } NET>t$$ $$OUT = 0 \text{ otherwise}$$
A neuron based on such a hard limiter function is called a perceptron. The perceptron is the basis for the backpropagation algorithm that will be discussed in more detail in the next section.
Most of the training algorithms in use today have evolved from the ideas of Hebb (Wasserman 1989). In his model from 1949, Hebb suggested that the synaptic strength, or the weight between two neurons, would be increased if both the source and destination neuron were activated. The idea behind this is simply that paths of activated neurons that occur often will also tend to happen often in the future, the classical idea of learning through experience. This type of learning is often called Hebbian learning (Kartalopoulos 1996).
An example from human perception of music may help to clarify this concept. Cadences have been mentioned earlier in this thesis as an example of how expectation arises if we hear a IIm7-V7 progression. But since we sometimes also encounter the IIm7-V7-VIm progression, this will also be “assigned” a relatively higher probability than other progressions. This way we are trained in recognizing patterns and their possible outcomes. So a IIm7-V7 will generate high expectations for either a I or a Vm chord. In a network such outcomes are governed by the strength of the weights between neurons, and the network will automatically “load” the sequence that is most likely to match the input.
A learning algorithm for a perceptron works like this:
Apply an input pattern and calculate the output $Y$
Evaluation:
a. If the output is correct, go to step 1;
b. If the output is incorrect and is zero, add each input to its corresponding weight
c. If the output is incorrect and is one, subtract each input from its corresponding weight
Go to step 1;
For continuous inputs and outputs, this method is generalized to what is called the Delta Rule. From step 2 of the perceptron learning algorithm, the difference between the target output $T$ and the actual output $A$ may be represented as
$$ \delta = T - A \quad $$
Notice how step 2.a corresponds to $\delta = 0$, 2.b corresponds to $\delta > 0$, and 2.c corresponds to $\delta < 0$. For all these cases, the algorithm is satisfied if $\delta$ is multiplied by the value of each input $x_i$ to the perceptron, and this product is added to the corresponding weight. Introducing the coefficient $\eta$ as a learning rate to control the average size of weight changes we get
$$ \Delta_i = \eta,\delta,x_i \ $$
$$ w_i(n+1) = w_i(n) + \Delta_i \quad $$
where
$\Delta_i$ = the correction associated with the $i$-th input $x_i$
$w_i(n+1)$ = the value of weight $i$ after adjustment
$w_i(n)$ = the value of weight $i$ before adjustment
This rule works appropriately for target and actual outputs, for both continuous and binary inputs and outputs. A problem, however, is that there is no way to know the number of training cycles that is required, except that it is a finite number.
Backpropagation
One of the most popular training processes for feedforward neural networks is the backpropagation algorithm. This algorithm was presented in 1986 by Rumelhart, Hinton and Williams, only to discover that it had been anticipated several times before, and as early as 1974 by Werbos (Wasserman 1989). This algorithm works with multi-layer networks such as shown in the figure below, where there is one input, one hidden and one output layer of connected neurons. The neuron used in the backpropagation algorithm is shown in the figure below, and is a slightly modified version of the neuron shown in the previous section.

Sketch of an artificial neuron with activation function. The inputs are multiplied with the corresponding weights, summed and sent out as NET and through the threshold function $F$.
This neuron produces both NET and OUT signals, where NET is simply a sum of all the input values multiplied with the corresponding weight, such that
$$ \mathrm{NET} = x_1 w_1 + x_2 w_2 + \dots + x_n w_n = \sum_{i=1}^n x_i w_i $$
Since the neuron used in the backpropagation is the perceptron, it is governed by a hard limiter threshold function, outputting either $0$ or $1$, where
$$ \mathrm{OUT} = F(\mathrm{NET}) $$
An overview of the training process for the backpropagation network is as follows:
- Select the next training pair and apply the input vector to the network input.
- Calculate the output of the network.
- Calculate the error between the network output and the desired output.
- Adjust the weights of the network to minimize the error.
- Repeat steps 1-4 for each vector in the training set. Stop when the sum error is low enough.
The error found in step 4 is the basis for tracing how the total error of the network is changing over time. This is important for deciding when to stop the training process.
The weights of the output layer are adjusted using an equation quite similar to the delta rule presented earlier, so
$$ \Delta w_{pq,k} = \eta,\delta_{q,k},\mathrm{OUT}_{p.j} \quad $$
and
$$ w_{pq,k}(n+1) = w_{pq,k}(n) + \Delta w_{pq,k} \quad $$
The hidden layers have no target vector, but backpropagation solves this by propagating the output error back through the network layer by layer, adjusting each layer on its way. So for the hidden layers, $\delta_{q,k}$ will not be present and must be calculated by
$$ \delta_{pj} = OUT_{pj}\left(1 - OUT_{pj}\right)\left(\sum_q \delta_{q,k}, w_{pq,k}\right) \quad $$
When the network is training, the general error function will (usually) decrease as the weights are getting adjusted. Eventually the network will come to a point where the error function does not get smaller, and that is called a minimum. However, there might be cases where there can be several local minima, but only one global. Such an error graph as a function of weight is shown the figure below.
Even though there are two cases that result in a low error in the figure below, there is only one global minimum. The problem is that when the learning algorithm reaches a local minimum (point A) it will not be able to reach the global minimum (point B). Recall from the previously presented equations that the weights are adjusted slightly in a direction that further decreases the error. If a global minimum is reached, we will necessarily have to increase the error function somewhat to be able to get to the global minimum. Since it is not possible to increase the error, we will either have to be satisfied with the result or start a new training session. The reason that a new session might overcome the local minimum, is because starting with “new” randomized weights for the whole network might result in lots of weights to vary. This might make it easier to “jump” over the local minimum and get to the global minimum (Dolson 1991).

Error as function of weights. There might be several local minima, but only one global minimum. An analogy to the training progression can be that of a ball rolling down a slope. Ideally it should roll all the way down to the global minimum, but it might also get stuck in the local minimum, and the training will have to start over.
Even though the backpropagation algorithm has been one of the most popular neural networks algorithms, it has also been criticized for its nonbiological behaviour. Kartalopoulos (1996) argues that biological neurons do not seem to work backwards to adjust their synaptic weights. As such, the algorithm can not be seen as a learning process simulating biological behaviours but rather as a method to design a network with learning. It has also been argued that the algorithm involves a lot of calculations and trains slowly for large networks, since the calculation time is proportional to the size of the network. Training is much faster when updating of the weights occurs after each training vector rather than for the whole training set (Tørresen 1997), but still the algorithm is not very well suited for real-time calculations. For my purpose, however, it has worked fine.
Simulating Timbre Recognition in a Neural Network
As a test, I decided to see if neural networks can be “trained” with timbre. If this is the case, such a trained network could then be used for either analysis or synthesis of instruments. The reasoning behind this is that humans have no problems of recognizing or identifying timbre, and since neural networks are meant to resemble the human brain they should also be able to do this. This idea was first suggested by (Dolson 1991), and later experiments by Wessel, Drame and Wright (1998) showed that this is indeed possible.
But what does it actually mean to “learn” the timbre of an instrument? What values should be used to train with? How can the training values be related to our perception? Since the interest is on investigating human perception, we should start by remembering that we perceive a tone with an associated pitch, loudness and timbre. So a neural network should therefore learn to recognize these three components. The first problem, however, is to figure out how these three elements can be described in terms of the physical signal.
In the following I will refer to the Rollins tone (Example 15a). Let us start by investigating the output of the pitch tracker in Addan. This is a file containing the values for F0 over time and looks like this:
0.020000 202.519958
0.030000 204.513733
0.040000 205.732803
0.050000 206.172501
0.060000 206.209244
0.070000 206.219498
0.080000 206.302322
0.090000 206.415588
0.100000 206.585770
... ...
Here the time window of 10 milliseconds is shown in the left column (in seconds), while the fundamental frequencies (in Hz) are in the right column. So this file helps solving the first problem, namely that of finding the pitch of the signal.
Next, a file with the spectral analysis of the sound shows information about the partials for each time window:
108 0.020000
1 206.744980 0.0133795282 2.397758
2 408.126007 0.0262175743 1.841558
3 609.051758 0.0387362503 1.888313
4 803.625549 0.1311578155 -2.563423
5 1010.885498 0.0567401312 2.021351
6 1215.241211 0.0564972423 0.464872
7 1428.672852 0.0234822202 0.049138
8 1623.785889 0.0651926547 -2.553016
9 1824.361450 0.0508930534 2.791992
10 2022.632690 0.0816930383 1.236977
... ... ... ...
This file is organized with a “header” before each new list of partials. One such header is shown above, and consists of a line with the numbers 108 and 0.020000. The first number (108) is the total number of partial frequencies found, and the second number (0.02) is the start of the time window in seconds. This indicates that there will be 108 lines containing information about the partials following this header. Each line of information about the partials contains four columns: partial number, frequency (Hz), loudness (relative amplitude), and phase (radians).
Based on this information we need to extract the features that correspond to what is perceived as loudness and timbre of the tone. Since the amplitude of each partial frequency is given in this file, it should be possible to estimate the perceived overall loudness of the tone from these values. When it comes to timbre, it can be adequately described with reference to about the 60 first partials and their amplitudes (as shown in Section 5.3). To simplify things, we can assume that the partial frequencies are harmonic, and can therefore be calculated by multiplying the value of F0 with the corresponding harmonic number. So when F0 is already in the training set, it is sufficient to include the amplitudes of each harmonic to be able to represent the timbre of the tone.
To summarize what the training data looks like, these are the following relationships between perceived attributes and physical signal:
- Pitch ⟷ Fundamental frequency (F0)
- Loudness ⟷ Sum of partial amplitudes
- Timbre ⟷ Set of partial amplitudes
Since feedforward networks are based on supervised learning, they are trained by applying both “question” and “answer” in the training process. As such, they learn to associate relationships between different sets of values. In this simulation I wanted the network to learn relationships between F0 and the sum of partial amplitudes on one side, and the set of partial amplitudes on the other. This means that a trained network would output the amplitude values of 60 harmonics when controlled by F0 and overall loudness.
There are some drawbacks with this approach. First, the assumption that all partials are harmonics will necessarily remove information about any inharmonic movement and minor transient changes. However, to reduce the training data to a feasible size, this seems to be the best option. Second, the networks will only be trained with “stationary” spectra, so there will be no information connecting consecutive time windows in the trained network. This means that the trained network will only be able to reproduce sound successfully if it is controlled with values of F0 and loudness similar to what it has been trained with. However, since my interest in this simulation was only to see whether networks can actually learn complex data structures such as timbre, this was not really a problem for this project.
Training the Neural Network
All the tools used in this simulation is shown in the figure below. In the following I will briefly go through the various parts.

Work chain from original sound in to network-synthesized sound out.
First of all, it was necessary to decide on the sound input. To simplify things I decided to use saxophone sounds only, taken from The Solo Album by Sony Rollins. This CD features nearly an hour of solo tenor saxophone, and is therefore a good source for finding examples from the entire range of the instrument. The recording quality is also quite good, with little noise and external sounds. From this CD five segments were selected (Examples 17a-e) of varying length and complexity, covering a large dynamic range and register of the instrument.
Second, pitch tracking and spectral analysis was done in Addan and output as text files. The content of these files were shown in the previous section, and formed the basis for extracting material to be used for the training.
When it comes to the training of the networks, I never intended to implement the backpropagation algorithm myself, and therefore relied upon finding available software. The Neural Network Toolbox in Matlab could have been used, but I finally selected the Stuttgart Neural Network Simulator (SNNS). With the newly released Java implementation JavaNNS, this runs smoothly on both Windows and Linux computers, and it offers far more options than necessary. However, the most important factor for choosing JavaNNS was a small Windows-program called SNNS2C, which was distributed with the software. This program takes a trained network file and converts it to a C-function, which can be compiled to a MAX/MSP object. This way it would be possible to write a MAX/MSP patch controlling the trained neural network.
Since JavaNNS was chosen for the training, the next problem was to merge and format the analysis data from Addan into the Pattern-files required by JavaNNS. This turned out to be the most time-consuming part of the whole project, since I ended up learning Perl-scripting from scratch. In retrospect, this could probably have been solved more easily in for example Matlab or even MAX/MSP, but I ended up spending weeks of debugging Perl-code before finally getting the little program Addan2Pat to work correctly. In retrospect, this could probably have been solved more easily in for example Matlab or even MAX/MSP, but I ended up spending weeks of debugging Perl-code before finally getting the little program Addan2Pat to work correctly. The source code for this program can be found in Appendix 3, and in the following its operations will be briefly explained.
Addan2pat takes two files as input, one containing the values of F0 over time and another with results from the spectral analysis (see Section 6.5), and outputs a file in standard SNNS pattern-format:
SNNS pattern definition file V3.2
generated at Fri Apr 22 13:25:23 2002
No. of patterns : 279
No. of input units : 2
No. of output units : 60
# Input pattern 1:
0.853751817703236 0.9146766973
# Output pattern 1:
0.0133795282 0.0262175743 0.0387362503 0.1311578155 0.0567401312
0.0564972423 0.0234822202 0.0651926547 0.0508930534 0.0816930383
0.0626092851 0.0177947525 0.0163455717 0.0173242018 0.0184137784 0
0.0192239378 0.024825735 0 0.0105436686 0 0 0.0119776605 0.017275244
0.0108059598 0.016561918 0.0206235815 0.0120163495 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
This pattern-file contains a header with information about when the file was created, the number of patterns (training sets), and the number of input and output values that the network will be asked to adjust its weights according to. The two values shown under “Input pattern 1” are F0 and overall loudness, while the values shown under “Output pattern 1” are the 60 harmonic amplitudes. These input and output pattern values make up one training set.
Notice that all the values in the pattern-file are in the range between 0 and 1. This is due to the dynamics of the network, where faster training will be obtained if the values are in a similar range. So the values were therefore normalized and linearly scaled to these limits in Addan2Pat.
The pattern-files created by Addan2Pat were opened in JavaNNS and training was done with the learning rate coefficient $\eta$ set to $0.2$ (see Equation 3). The weights of the network were randomized prior to training, and the patterns of training data were shuffled for each training cycle. Due to the quite large training sets, I had to run up to 3000 cycles before the network was well trained. the figure below shows a screenshot from JavaNNS with the fully connected network in the background, the error graph and control panel.

Interface of JavaNNS. The 2-60-60 network is shown in the background. A plot of the decreasing error function is shown in the bottom left corner, and a control panel for the training is shown in the bottom right corner.
The trained network was saved to a text file containing information about connections of the network and the values of the weights between neurons. This text file was converted to a C-function with the little program SNNS2C, and this function was used together with a “C-wrapper” (The C-wrapper was written by Matthew Wright at CNMAT, and is available in Appendix 4.) when compiling a MAX/MSP object of the trained network. Compilation was done with the commercially available Metrowerks CodeWarrior 7.0 under Mac OS 9.2. So finally, after all these various stages, the compiled object could be used for timbre synthesis.
The Trained Neural Network Object
The compiled object of the trained network can be used as any other MAX/MSP object. As shown in the figure below, the object takes two inputs (F0 and overall loudness) and outputs a list of 60 values (amplitudes for each harmonic). It is important to remember that even though the network has been trained to learn relationships between F0/Loudness and sets of amplitudes, the actual output of the network will be based on the overall activation of the network. This means that applying a certain F0/Loudness to the network will not necessarily result in a set of amplitudes that is an exact match with the training data, since the weights have been adjusted to give the best overall performance. On the other hand, this flexibility of the network makes it able to generalize beyond the data sets it has been trained with.

The trained network object takes two inputs (pitch, loudness) and outputs 60 values as a list (amplitudes for each harmonic).
To control the object I made the patch NN-Control. The “interior” of the user interface is shown in the figure below, and shows how a multislider object with 60 sliders is connected to the network object. The harmonic frequencies are found by simply multiplying the value for F0 with the corresponding harmonic number, such as described in some of the previously described patches (see Section 2.3). The list of the harmonic frequencies is merged with a list of the harmonic amplitudes coming from the network, and sent to the sinusoids~ object for the additive synthesis. The input values to the network can be controlled by changing the F0 and Loudness sliders, or by using the two-dimensional “control space” allowing control of pitch (horizontal) and loudness (vertical) with a mouse or graphical tablet.
I recommend the reader to try and play with the object. Notice that even though the network only produces “stationary” spectra there is certainly some saxophone quality of the output sound.

Control patch for the trained neural network object. The network is connected to a multislider showing the amplitudes of each harmonic.
The fact that the network actually managed to train, and that the resultant sound has some saxophone-like quality, can be taken as an indication of a successful simulation. So it can be concluded that the network did indeed manage to “learn” timbre. However, there are also a number of elements that can be improved. First, the simple addition of harmonic amplitudes to find the overall loudness does not take into account the non-linearity of the auditory system. So implementing an auditory model would be much better. Second, the normalization and scaling of the training values were rather rough, and I suspect this to be a reason for problems with too high values for the upper partials of the output sound.
A more fundamental weakness of this model is that it does not take into account the development of partials (harmonic and inharmonic) through time. This might be improved in the future by training a second network with some time-varying parameters. Yet another improvement would be to train networks with timbre from various instruments, and also to associate instrument names with these different training sets. This could again be used to compile reversed network objects, where timbre can be input and the network can relate this to the correct instrument name. These and many other improvements will have to wait for future projects. As for now, I am quite satisfied to have gotten this far.
Conclusions
This chapter started with a discussion of the differences between rule-based and connectionist models. A problem with rule-based systems is that they work serially and will therefore be bound by always doing processing sequentially. They will also always be limited to the rules they are set up with, and can never learn to generalize or come up with solutions outside of its “domain”. Connectionist models or artificial neural networks, on the other hand, are parallel and distributed systems working by activation of neurons. They can therefore process different types of information in parallel throughout the network, and can learn from experience by adjusting weights between the neurons.
My interest was, besides learning the basic theories of neural networks, to see if it is possible to train a neural network with timbre. I decided to train networks to learn relationships between fundamental frequency (F0) and overall loudness as input values, and sets of harmonic amplitudes as output values. The focus of the simulation was more on understanding the concepts behind neural networks and getting everything to work, than the actual results. Thus the expectations were moderate and the fact that I actually managed to get every chain in the process to work, was in itself satisfactory. This involved doing spectral analysis in Addan, writing the Perl-script Addan2pat, training networks in JavaNNS, converting the trained networks to C-functions with SNNS2C, compiling a MAX/MSP object with a “C-wrapper”, and finally making a patch running the object in MAX/MSP. This setup seems to work well and might be suitable for future experiments.
The fact that the networks managed to train well and that the output sound might be characterized as “saxophone-like” seems like a good indication that the simulation was successful. The reason that the sound output of the network object is not very pleasing is due to the fact that the networks are only trained to learn “stationary” spectra. So an improvement of the model would be to add some way to control instrumental-specific envelopes. More training data, better normalization, and implementation of an auditory model would also be good improvements for future simulations.
Despite the seemingly “uselessness” of the trained object as it is now, the results might be taken as an indication that neural networks can be used for understanding more of how the human brain is recognizing timbre. With the improvements mentioned above, I believe such a model can be used both for recognition of different instruments directly from the sound source and also as a control structure for instrument synthesis.
This text is part of my master’s thesis.