Introduction to Machine Learning and Reservoir Computing

Introduction to Machine Learning and Reservoir Computing #

Reservoir computing is a computational framework derived from recurrent neural network models. It can be used for machine learning which has seen a tremendous rise over the past decade mainly due to increased computational resources and utilisation of Graphics Processing Unit (GPU) (Citation: N.A., ) (N.A.).(). Retrieved from https://openai.com/blog/ai-and-compute/ . This chapter provides a high-level overview of machine learning techniques, focusing on (physical) reservoir computing. We start from a very descriptive introduction in Machine Learning and gradually add details until we have covered all necessary parts to understand the remainder of this dissertation.

Machine Learning #

Machine learning techniques aim to learn a suitable model from the data (within a specific set of constraints), thus skipping the need to model the underlying processes explicitly. They are flexible algorithms whose parameters are tuned using training data but should generalise well to unseen data and conditions. Essentially, machine learning algorithms operate as black boxes that process inputs and provide outputs. The inner workings of the model are – in general – not related to the underlying processes. The model cannot be interpreted by looking at the model parameters. This is in stark contrast to white box models, where parameters of the model correspond to interpretable concepts and underlying processes.

Take for instance Gompertz model for plant growth (Citation: & , ) & (). The use of Gompertz models in growth analyses, and new Gompertz-model approach: An addition to the Unified-Richards family. PLOS ONE, 12(6). e0178691. https://doi.org/10.1371/journal.pone.0178691 :

W(t) = A \exp\left[-\exp\left(-k_G(t-T_i)\right)\right]\text{.}

W(t) is the expected value (mass or length) as a function of time (for example, days since germination or growing degree days), A represents the upper asymptote (saturation value), k_G is a growth-rate coefficient (which affects the slope), and T_i represents the time at inflexion, i.e. the time at which growth is maximal.

Gompertz growth model for plants.
Figure 2.1: Gompertz growth model for plants.

The Gompertz equation originates from medicine, where it was used to describe human mortality (Citation: , ) (). On the Nature of the Function Expressive of the Law of Human Mortality, and on a New Mode of Determining the Value of Life Contingencies.. https://doi.org/10.1098/rspl.1815.0271 . Later it was identified from experimental data that it is also applicable to other growth processes, including plant growth (Citation: & , ) & (). The use of Gompertz models in growth analyses, and new Gompertz-model approach: An addition to the Unified-Richards family. PLOS ONE, 12(6). e0178691. https://doi.org/10.1371/journal.pone.0178691 . Each of the parameters in the model originates from observations and are readily interpretable. However, formulating a model like eq. 2.1 is not always obvious, and in many situations, it is impossible to find such a general model.

Consider, for example, the following problem: we want to predict the dry matter yield of ryegrass at the end of the growing season based on the current growing state of a field. This is highly relevant information for breeders to speed up the selection process of new cultivars. Illustrative data is depicted in figure 2.2 for several cultivars. Intuitively, we can see that it is impossible to find an exact solution to this problem. There are many influential factors, including the weather, genetic factors, different management practices and soil heterogeneity. So finding the underlying model is infeasible. However, based on previous data, a machine learning system can learn part of the underlying factors contributing to the yield and how these factors contribute to the result. As such, it can provide an estimate of the yield.

Dry matter yield data from a ryegrass trial at ILVO.
Figure 2.2: Dry matter yield data from a ryegrass trial at ILVO.

The aforementioned problem is an example of supervised learning. In supervised learning, the algorithm is presented a series of input-output pairs. The goal is that the learned model should approximate the output from the input data. Moreover, it should be done such that previously unseen inputs also generate the correct outputs. However, not all problems have well-defined in- and outputs. When this is the case, the term unsupervised learning is used because there is no direct way to evaluate the performance of the model. Here, the machine learning algorithm should try to find patterns in the data. Another class of machine learning problems is reinforcement learning. In this last case, the algorithm interacts with a dynamic environment to achieve a certain goal. For instance, we can train a computer to play a game like pong, a simple version of ping-pong on the computer. The agent can take three actions: stay in place, move up or move down. A single move up or down is not rewarded, but bouncing the ball back to the other side is. A series of actions are required to achieve the goal. Moreover, there is no predefined set of actions that is the best. Another famous example is AlphaGo Zero (Citation: , & al., ) , , , , , , , , , , , , , , , & (). Mastering the game of Go without human knowledge. Nature, 550(7676). 354–359. https://doi.org/10.1038/nature24270 . The rules of the Go game are straightforward, yet the number of possible actions is enormous. Moreover, it is challenging to define intermediate steps to help the agent learn the game. Nonetheless, researchers at Google Deepmind managed to create a system that can beat the best human players (Citation: , & al., , , , , , , , , , , , , , , , & (). Mastering the game of Go without human knowledge. Nature, 550(7676). 354–359. https://doi.org/10.1038/nature24270 ; Citation: , ) (). AlphaGo seals 4-1 victory over Go grandmaster Lee Sedol. Retrieved from https://www.theguardian.com/technology/2016/mar/15/googles-alphago-seals-4-1-victory-over-grandmaster-lee-sedol .

This introduction focuses on supervised learning since reservoir computing is traditionally applied in a supervised context. Supervised learning is typically categorised roughly into regression and classification problems. In regression problems, the goal is to predict a certain quantity (real number) from the input data. Other examples include stock price prediction, plant biomass estimation from image data, object counting in images and weather prediction. Yet, not all problems fall within this category. There are also classification problems, where the goal is to categorise the data in a certain set of classes. For instance, the automated subject detection in Google Photos, written character recognition in envelopes, natural language problems, and product recommendation systems (Netflix, groceries) are all examples of classification problems. Reservoir computing can be used for both problem classes, but in this dissertation, the focus will be restricted further to regression problems only.

Now we shift our focus from the goal and output to the machine learning model itself. In the next sections, we discuss several linear and nonlinear models that are used later in this dissertation. A first set of models are the linear models, discussed in Linear Regression Models. These are the simplest models. Their behaviour is well understood, and optimisation is straightforward. Nonlinear models are more powerful, at the cost of increased complexity and more difficult optimisation.

Linear Regression Models #

One of the simplest types of models is a linear model. They have the advantage that model behaviour is usually well understood, easy to optimise and use. Consequently, they are very popular in nearly all fields of research, from micro-climate research to sociology (Citation: , & al., , , , , & (). Climate at ecologically relevant scales: A new temperature and soil moisture logger for long-term microclimate measurement. Agricultural and Forest Meteorology, 268. 40–47. https://doi.org/10.1016/j.agrformet.2018.12.018 ; Citation: & , ) & (). The Methodological Divide of Sociology: Evidence from Two Decades of Journal Publications. Sociology, 54(1). 3–21. https://doi.org/10.1177/0038038519853146 .

To illustrate the power and limitations of linear models, we will consider the following problem: we want to predict the growth curve of a variety of ryegrass. These are useful to construct a simulated version in the computer later on. As a first attempt, we will fit a polynomial to a series of observations. These observations (x_i) are the input of the model. This is sometimes also labelled as input feature, a specific type of input to the model. The data are depicted in figure 2.3. Before we can fit a model to the data, we first have to decide on the order of the polynomial that we want to use. One can choose a linear, second or fifth-order polynomial, for example. Depending on the order, the model has very different behaviour, illustrated by figure 2.3. The order of a model is a so-called hyperparameter.

Hyperparameters tune the model’s abilities and come in various forms. They can determine the model size (like here), learning ability or learning speed, for instance. They share the common attribute that they remain constant after initialisation. However, the choice for these parameters is very important. We will go into more detail on this in Generalisation, Bias and Variance.

Suppose that we want to fit a second-order polynomial to this data, then we have four different unknown model parameters: w_0, w_1 and w_2. The model output \hat{y}(t) has the following form:

\hat{y}(t) = w_0 + w_1 x(t) + w_2 x^2(t)\text{.}
Linear polynomial regression models fit to dataset with unknown underlying model.
Figure 2.3: Linear polynomial regression models fit to dataset with unknown underlying model.

Usually, one wants to optimise the coefficients to minimise the mean squared error (MSE) between the predicted output \hat{y}_i and observed output y_i for all samples ( M in total):

\mathcal{L} = \dfrac{1}{M} \sum_{i=0}^{M-1} \left(y_i - \hat{y}_i\right)^2\text{.}

Finding the optimal values for all w_i ’s is not trivial from this formulation. To obtain a more workable form, we have to rewrite this as a matrix multiplication:

\begin{aligned} \text{w} & = {\begin{bmatrix}w_0 & w_1 & w_2 \end{bmatrix}}^{\intercal} \\ \text{x}_i & = {\begin{bmatrix}1 & x_i & x_i^2\end{bmatrix}}^\intercal \\ \hat{y}_i & = {\text{x}^{\intercal}_i} \text{w}\text{.} \end{aligned}

If we concatenate all samples into a single matrix, we obtain:

\hat{\text{y}} = \begin{bmatrix} \hat{y}_0 \\ \hat{y}_1 \\ \vdots \\ \hat{y}_{M-1}\end{bmatrix} = \begin{bmatrix} {\text{x}^{\intercal}_0}\text{w} \\ {\text{x}^{\intercal}_1}\text{w} \\ \vdots \\ {\text{x}^{\intercal}_{M-1}}\text{w}\end{bmatrix} = \text{X}\text{w} \text{.}

Now, we can rewrite the loss \mathcal{L} to:

\begin{aligned} \mathcal{L} & = \dfrac{1}{M} \lVert{\text{y}\rVert - \hat{\text{y}}}_2^2 \\ & = \dfrac{1}{M} \lVert{\text{y}\rVert - {\text{X}^{\intercal}}\text{w}}_2^2 \\ & = \dfrac{1}{M} {\left(\text{y}^{\intercal} - \text{X}\text{w}\right)} \left(\text{y} - \text{X}\text{w}\right)\text{.} \end{aligned}

We want to minimise the error with respect to \text{w} , so we have to differentiate the above equation with respect to \text{w} :

\begin{aligned} \nabla_{\text{w}}\mathcal{L} & = -2{\text{X}^{\intercal}}{\left(\text{y}-\text{X}\text{w}\right)}\text{.} \end{aligned}

When we assume that \text{X} has a full rank (and thus, {\text{X}^{\intercal}}\text{X} is positive definite), then th above expression for \nabla_{\text{w}}\mathcal{L} has a unique solution and we can invert this previous expression and obtain the solution for \text{w} :

\begin{aligned} \text{0} & = -2{\text{X}^{\intercal}}\left(\text{y}-\text{X}\text{w}\right) \\ & = {\text{X}^{\intercal}}\text{y} - {\text{X}^{\intercal}}\text{X}\text{w} \\ & \Rightarrow \text{w} = {\left({\text{X}}^{\intercal}\text{X}\right)}^{-1}{\text{X}}^{\intercal}\text{y}\text{.} \end{aligned}

Now, we have the optimal values for \text{w} with respect to \mathcal{L}. However, the choice for a second-order polynomial was a bit arbitrary. We could have chosen a first or fifth-order as well and also have obtained unique solutions for \text{w}. As one can see in Figure 2.3, the more complex the chosen model, the better the model fits our data. However, this does not imply that the obtained model is a good fit for the problem at hand. We need to modify our approach to make the final model behave well on unseen data. More on this in Generalisation, Bias and Variance. First, we continue our explorations of linear models and their properties.

specified a second-order polynomial, but this is not the only form of linear model (Citation: , ) (). Pattern Recognition and Machine Learning. Springer-Verlag. Retrieved from https://www.springer.com/gp/book/9780387310732 . Generally, a linear model takes the following form:

\hat{y}(t) = w_0 + \sum_{i=1}^{N} w_i \phi_i(\text{x}(t))\text{.}

Again, we observe the bias term w_0, and one major addition: \phi(\cdot) . This is a fixed nonlinear transformation. In the previous example, this was: \phi_1(x) = x, \phi_2(x) = x^2 and \phi_3(x) = x^3. We also observe that the input is generalised to enable the use of multiple input variables or features.

Generalisation, Bias and Variance #

While the model obtained in Linear Regression Models is optimised, it is generally not usable on new data. The number of parameters vs. the number of samples is quite low. The model thus has the ability to fit the data almost exactly due to the high degree of freedom. Suppose we perform a new set of measurements (depicted in Figure 2.4 and measure the model performance in terms of MSE. In this case, the highest order polynomials appear a lot less interesting indeed. The results are summarised in Table 2.1.

Models from figure 2.3 compared with new sample data (dataset 2).
Figure 2.4: Models from figure 2.3 compared with new sample data (dataset 2).
Table 2.1: Polynomial model comparison between training data (dataset 1) and validation data (dataset 2), containing unseen data that was now used for model optimisation. Lower values indicate better performance for the MSE error metric.
MSE
1st order2nd order5th order
dataset 134.5711.664.35
dataset 237.3418.334.88

This poor model performance of the fifth-order model is due to model overfitting. The model is overly optimised to the data that is used during optimisation. These data are often labelled training data because it is used to tune the weights of the model. As a result, using these data to assess the model’s performance on unseen data is not possible. Of course, the model behaves well on these data; we tuned it to work well on it. It is also clear that the fifth-order model will have better performance than the first-order model because of the increased number of weights. This enables the model to capture more variation present in the data. These variations stem from the underlying processes that generated the data and noise in the data. Generally, these underlying processes are unknown. Usually, there is no way to distinguish between both, but obviously, the model should not be optimised to the noise present in the data.

One common technique to improve model performance is to use regularisation. Regularisation includes the model’s weight values (except for w_0) into the error function using a parameter \lambda:

\mathcal{L} = \dfrac{1}{M} \sum_{i=0}^{M-1}\left(y_i - \hat{y}_i\right)^2 + \lambda\sum_{i=1}^{N-1}w_i^2\text{.}

The choice of \lambda determines the range of the weight coefficients in \text{w}. Effectively, we have now introduced a new parameter in our model. Since \lambda tunes the models’ ability to learn, this is another example of a hyperparameter, similar to the order of the employed polynomial. A hyperparameter is a powerful tool, but also introduces an additional training step: determining the optimal value of all hyperparameters. More on this later on in this section.

Yet, why does this help in improving model performance? This stems from the observation that coefficients tend to take on high values when the noise in the data is also modelled. In our example from Figure 2.3, the mean absolute coefficient value is 0.45, 0.12 and 1.81 for the first, second and fifth-order model respectively. The better performing models have smaller absolute weight values.

But how should \lambda be determined? This is usually done using a separate dataset. First, one has to determine an appropriate interval for \lambda; since this is not known beforehand, a logarithmic evaluation is often performed. The exact value of \lambda is subordinate to the magnitude. Second, the model is optimised for each of these choices on the training data. So for each new set of hyperparameters, a new set of weighting coefficients is determined since this optimisation step is performed anew. Third, the validation data is used to assess the model’s performance for this specific choice of \lambda. Based on the optimal value for the performance metric, we select the final value of \lambda. The model with the best performance on both validation and train data is thus selected.

When applying these steps to our toy problem, we obtain Figure 2.5. The model performance is very different for the training and test data. Generally speaking, lowering the value for \lambda improves performance on the training data (Figure 2.5a). This corresponds to reducing the penalty of large weights, thus removing restrictions on the model weights. However, this same observation does not hold for the validation dataset (Figure 2.5b). The MSE value has a clear minimum value for all models, with decreased performance for very large and very small values of \lambda, especially for the fifth-order model. For small values of \lambda, the model captures part of the noise characteristics, as is evident in the very large MSE values. Increasing \lambda, decreases the error until a minimum value is reached. Higher \lambda values restrict the weight coefficients in learning a useful model. As a result, the MSE value increases. visualises the magnitude of the weight coefficients for variable \lambda. We can clearly observe the decay of the weight coefficients for large values of \lambda.

Note also that \mathcal{L} and MSE are two different optimisation targets. \mathcal{L} is the loss function. It is a model-dependent function that is used to optimise the model coefficients. is an example of such a function. However, it is not used to evaluate the model performance on unseen data. For instance, in the case of the loss function in eq. 2.16, model weights also contribute to the loss. This is undesirable for comparisons since it adds a constant bias term that is model-dependent in place of data-dependent. As a result, a different performance metric is used, such as MSE. This metric is used to determine the hyperparameters based on validation data and compare the performance of different models on test data.

Effect of hyperparameter λ on model performance and coefficient distribution.
Figure 2.5: Effect of hyperparameter λ on model performance and coefficient distribution. Polynomial model of first, second and fifth-order’s performance for MSE loss function for the training (a) and validation datasets (b) for variable regularisation parameter λ. (c) depicts the evolution of the coefficient distribution for different λ.

Naively, one could use the minimum value obtained using the above procedure as the final model performance. However, since we used the validation data to finalise the hyperparameter choice of the model, this is not an unbiased estimate. We need a third dataset, often called the test data that is not used for training or validation, i.e. that has not been used to optimise the model in any way. The performance on this dataset is the final performance of the model. Ideally, this should be very close to the validation performance. If this is not the case, then the model is still overfitting on the data, and additional measures are needed to improve performance.

The choice of regularisation is dependent upon the task the model has to perform, and the resulting model can have different properties. The regularisation used in eq. 2.16 is called Tikhonov regularisation (Citation: , ) (). On the solution of ill-posed problems and the method of regularization. Doklady Akademii Nauk SSSR, 151. 501–504. Retrieved from http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=dan&paperid=28329&option_lang=eng and the regression model (i.e. the combination of a linear model with Tikhonov regularisation) is often called ridge regression. A more general form is:

\mathcal{L} = \dfrac{1}{M} \sum_{i=0}^{M-1}{\left(y_i - \hat{y_i}\right)}^2 + \lambda\sum_{i=1}^{N-1}{\lvert w_i\rvert}^q\text{.}

is a visual depiction of of three q values and the effect thereof on the structure of the solution \text{w}^* in two dimensions. To understand Figure 2.6, we can interpret the above equation as eq. 2.3, subject to the following constraint:

\sum_{i=1}^{N-1}{\lvert w_i\rvert}^q \le \gamma\text{.}

Simply put, we split eq. 2.17 in two parts that have to be met simultaneously to minimise \mathcal{L}. The first part of the sum is the same as eq. 2.3, which defines circles in the case of two-dimensional input. The second part can be rewritten as eq. 2.18, which defines a centred shape as illustrated in Figure 2.6. Because both conditions have to be met, the intersection point defines the optimal value for \text{w}: \text{w}^*.

Contour plots of the weight vector w in blue, subject to different regularisation constraints.
Figure 2.6: Contour plots of the weight vector w in blue, subject to different regularisation constraints (q = 0.5, 1, 2). The optimal weight vector w^* is depicted by a dot.

For q\le1, the solution will generally be sparse since the derivative is not defined for points on the axes (i.e. the sharp tip). Consequently, the chance of a result on one of the axis will be are much higher. Sparse solutions can be interesting when there are many inputs (and thus also weights), and one wants to find the most important ones. least absolute shrinkage and selection operator (LASSO) (q=1) is a popular regression model that produces a sparse solution (Citation: , ) (). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1). 267–288. Retrieved from http://www.jstor.org/stable/2346178 .

Figure 2.7 visualises the optimised models corresponding to minimal MSE in Figure 2.5. All three models are a lot closer to each other. Performance on three data sets is summarised in Table 2.2, where the test data is new data that was not used for training and/or validation. Generally, all three datasets are obtained by splitting the initial data into three subsets. Sometimes more complex systems are used, such as K-fold cross-validation, but they are out of the current scope of this introduction.

Optimised polynomial regression models of first, second and fifth-order subject to ridge regularisation.
Figure 2.7: Optimised polynomial regression models of first, second and fifth-order subject to ridge regularisation.

Splitting the data cannot be done arbitrarily. Suppose we want to classify leaf images as diseased or healthy. To that end, we have a dataset of 200 images: 100 disease infected leaf images and 100 healthy leaf images. If we were to use a train dataset of 75 healthy and 5 infected images, a validation dataset of 20 healthy images and no infected images, and the remaining images for testing, then the model would perform poorly. Clearly, this is not a good data split. There are no infected images in the validation dataset and many more in the test data than in the training data. As a result, the model is likely to perform poorly on infected leaf images due to the lack of examples during training and a bias towards healthy leaves in the validation data. Thus, when splitting data into multiple subsets, it is important to make sure they are independent and identically distributed. This means that there should be no overlap or correlation between the different subsets, and all subsets should present the data in a similar way (i.e. have the same percentage of healthy vs. infected leaves in this case).

In reality, it is often even more complicated as it is more likely that there will be more images of the healthy subclass than the diseased subclass. In that case, additional measured are needed that are out of scope of this introduction.

Table 2.2: Performance comparison MSE of optimised polynomial regression models from Figure 2.7 on three datasets: the training, validation and test data.
MSE
1st order 2nd order 3rd order
train data 42.003 14.071 10.944
validation data 47.498 17.922 15.168
test data 43.032 10.322 62.95

Let us now revise our model choice. In Linear Regression Models, we assumed a second-order model. However, this choice is quite arbitrary. Nothing is limiting us from taking a fifth-order model. As can be seen in Figure 2.3, this model has a much better fit to the sample data than the second-order model. However, when the noise in our data is slightly different, the model has very different behaviour in some locations, as is observed in Figure 2.8, while the first and second-order models are much more alike.

Optimised polynomial regression models of first, second and fifth-order subject to ridge regularisation, trained with a different dataset (but the same underlying world and noise model).
Figure 2.8: Optimised polynomial regression models of first, second and fifth-order subject to ridge regularisation, trained with a different dataset (but the same underlying world and noise model).

Based on the results in Figure 2.8, one can conclude that a simpler model is the best in this case. Deciding which model is best is, however, not always trivial. This illustration is an example of the bias-variance dilemma.

Suggested works with more in-depth and complete overviews of linear and nonlinear models are Citation:  () (). Pattern Recognition and Machine Learning. Springer-Verlag. Retrieved from https://www.springer.com/gp/book/9780387310732 (Citation: , & al., ) , & (). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Retrieved from https://doi.org/10.1007/978-0-387-84858-7 .

Multi-layer Perceptron Models #

Linear models like in Linear Regression Models and Generalisation, Bias and Variance are powerful for a wide range of problems, but for some problems it is difficult to optimise a model when there is no direct relationship between the input and output. For instance, when we want to classify images, a linear classifier will typically fail; or in case we want to predict the short term electricity demand or predict how the 3D structure of proteins from an amino acid sequence, there is no simple map possible from the input to the output. To solve such problems, researchers often use (Citation: , & al., , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , & (). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873). 583–589. https://doi.org/10.1038/s41586-021-03819-2 ; Citation: , & al., , & (). Load demand forecasting of residential buildings using a deep learning model. Electric Power Systems Research, 179. 106073. https://doi.org/10.1016/j.epsr.2019.106073 ; Citation: , & al., ) , & (). Using Deep Learning for Image-Based Plant Disease Detection. Frontiers in Plant Science, 7. 1419. https://doi.org/10.3389/fpls.2016.01419 . These are (very loosely) biologically-inspired systems.

The human brain is composed of up to 86 billion neurons, and each neuron has approximately 7000 synaptic interconnections Citation:  () (). The human brain in numbers: a linearly scaled-up primate brain. Frontiers in Human Neuroscience, 3. https://doi.org/10.3389/neuro.09.031.2009 . Neurons process spikes; they receive spikes from other neurons and sensory organs and, depending on their frequency, and intensity can generate spikes themselves. One can view the brain as composed of a very large set of very simple processing units that, when combined, can solve very complex tasks. All these processing units operate in parallel.

Although we are not interested in the exact functioning of the human brain, we can use it as inspiration to design systems that solve tasks our brain solves easily. Such networks are called artificial neural network (ANN) since they are man-made (Citation: , ) (). Principles of Neurodynamics. Perceptrons and the Theory of Brain Mechanisms CORNELL AERONAUTICAL LAB INC BUFFALO NY Retrieved from https://apps.dtic.mil/sti/citations/AD0256582 . A simple example of a feed-forward artificial neural network is depicted in Figure 2.9 as a graph. Artificial neurons or perceptrons are displayed as nodes, while the directional arcs (arrows) represent the interconnectivity from one perceptron to the next. This network has a layered structure. The main idea behind this is that each layer processes the input at a higher abstraction until we reach the final output layer and desired output. Additionally, if there were only a single layer, the ANN would reduce to a linear model.

Simple feed-forward \glsxtrshort{ANN} with three fully interconnected hidden layers.
Figure 2.9: Simple feed-forward \glsxtrshort{ANN} with three fully interconnected hidden layers.

To make this more concrete, consider the following example: handwritten digit recognition. Each layer is designed to integrate an increasing amount of information. The first layer can find strokes, combine these strokes in the next layer to find corners or circles and finally combine all this data to classify the image as a certain digit.

Not only the interconnectivity is important, but equally important is how a neuron processes the inputs. The output of neuron i, y_i, is composed of a weighted sum of the inputs that is nonlinearly transformed using an activation function \psi :

y_{i} = \psi{\left(\sum_{j=1}^{C} x_{j} w_{j}\right)}\text{.}

This is visualised in Figure 2.10.

A single perceptron.
Figure 2.10 A single perceptron.

Figure 2.11 depicts historically popular activation functions such as the tangent hyperbolic (\tanh), logistic function and also the more modern rectified linear unit (ReLU). ReLU-based ANN (and its many variants) are generally easier to compute and train than their historical counterparts because it avoids a vanishing gradient.

ANN activation functions.
Figure 2.11: ANN activation functions.

Feed-forward neural networks are the simplest since the outputs of a layer form the inputs of the next layer until the final layer is reached. A system with one hidden layer is already a general approximator, but modern systems have much more layers since this tends to be more efficient for training. Ideally, we want a system with as few perceptrons as possible since training time increases dramatically when the amount of interconnections increases. This translates in practice to the use of many layers and more specialised architectures where each perceptron is only connected to a part of the perceptrons in the next layer, leading to deep learning (Citation: , & al., ) , & (). Deep learning. Nature, 521(7553). 436–444. https://doi.org/10.1038/nature14539 . Models with over a billion parameters have been developed and trained (Citation: , & al., ) , , , , , , , , , , , , , , , , , , , , , , , , , , , , , & (). Language Models are Few-Shot Learners. arXiv:2005.14165 [cs]. Retrieved from http://arxiv.org/abs/2005.14165 .

The previous algorithm from Linear Regression Models does not scale to very large datasets and models. Matrix inversion is a very complex and time-consuming operation, so one has to resort to alternative training methods.

Finding a global optimum is no longer possible due to the lack of a closed-form solution to the optimisation problem. Thus, instead of trying to find a global optimum, we have to resort to finding local optima. Local optima can be obtained by evaluating the model with slightly different parameters in the neighbourhood of its current state. When the model is not yet locally optimal, there will be a set of parameters, close the current parameters, where the model has better performance. We now only have to evaluate the model in the neighbourhood of its current coefficients. This approach is called gradient descent since we follow the gradient towards regions with lower loss.

Gradient descent optimisation technique illustrated on the loss-curve from the polynomial regression of the fifth-order model in Generalisation, Bias and Variance.
Figure 2.12: Gradient descent optimisation technique illustrated on the loss-curve from the polynomial regression of the fifth-order model in Generalisation, Bias and Variance.

This technique can also be applied to linear problems. To illustrate the principle, we consider the cost curve for the from Figure 2.5b for the fifth-order model in Figure 2.12. Suppose that our initial guess for \lambda is 1.0 (the blue square ). Evaluating \lambda in that region will cause the algorithm to decrease the value of \lambda, indicated by the direction of the arrow, since the error decreases in this direction. It will reach a local optimum at the red dot (). This example highlights the power of this approach: we need fewer evaluations to find the optimum, but also its weakness. Only evaluating in the local space around the current point does not yield the global optimum (here ) in general. The initialisation point is very important in this case. Note that while we can apply gradient descent to hyperparameters, this is unconventional. Here, this example served solely to illustrate the working principle of gradient descent. Remember from the previous that the exact value for \lambda is subordinate to the magnitude. In reality, one would, for instance, perform a logarithmic sweep of \lambda at 10^{-5}, 10^{-4}, …, 10^5.

While feed-forward ANN are powerful, they are generally not capable of processing time-series data. Time series data generally require the system to maintain information about previous inputs. For example, an English word is composed of different sounds. The short sound samples have no real meaning on their own, yet combined, they do. Furthermore, longer dependencies exist also. Words such as it and there can refer to previous information passed in a conversation. One way to capture such dependencies is to use a window of inputs. Instead of presenting a single input sample to the model, multiple samples in the past are also presented. The window size is a hyperparameter that has to be optimised. However, this approach also has its limitations. For instance, relations that are far apart can fall out of the window. The model can thus not perform the correct inference.

As a result, researchers have added feedback connections to the network. This way, the network can retain information about its previous state. These feedback connections have large consequences on the behaviour of the network. A recurrent neural network (RNN) is no longer a mathematical function as is the case for feed-forward ANN, but a dynamical system. Due to the existence of feedback, the network can self-sustain temporal activation dynamics without active input to the network. Additionally, the network’s internal state becomes a function of previous inputs, resulting in dynamic memory. Consequently, training the network has become a lot more complex because the cost landscape can change drastically over very small perbutations of the model parameters (bifurcations) (Citation: , ) (). Bifurcations in the learning of recurrent neural networks. https://doi.org/10.1109/ISCAS.1992.230622 . Gradient descent based algorithms are thus not well-suited for RNN. At the turn of the millennium, a lack of alternative training methods led researchers to explore radically different approaches, giving rise to reservoir computing (Citation: & , ) & (). Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. In A Field Guide to Dynamical Recurrent Networks. (pp. 237–243). IEEE. https://doi.org/10.1109/9780470544037.ch14 .

However, two decades later, the field of RNN has progressed significantly and the historical context that gave rise to reservoir computing is not longer valid today. long short-term memory (LSTM) cells, first introduced by Citation: &  () & (). Long Short-Term Memory. Neural Computation, 9(8). 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 , are currently used in state of the art RNN architectures (Citation: , & al., ) , , & (). A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures. Neural Computation, 31(7). 1235–1270. https://doi.org/10.1162/neco_a_01199 . These results have moved beyond what is possible with reservoir computing and LSTM is now the de facto standard for RNN architectures. LSTM-based RNN have been used to tackle problems in reinforcement learning (Citation: , ) (). DeepMind’s AI, AlphaStar Showcases Significant Progress Towards AGIMedium. Retrieved from https://medium.com/@towardai/deepminds-ai-alphastar-showcases-significant-progress-towards-agi-93810c94fbe9 , speech recognition (Citation: , & al., ) , & (). Hybrid speech recognition with Deep Bidirectional LSTM. https://doi.org/10.1109/ASRU.2013.6707742 and sign language translation (Citation: , & al., ) , , , & (). Neural Sign Language Translation. Retrieved from https://openaccess.thecvf.com/content_cvpr_2018/html/Camgoz_Neural_Sign_Language_CVPR_2018_paper.html . Citation:  () (). Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Physica D: Nonlinear Phenomena, 404. 132306. https://doi.org/10.1016/j.physd.2019.132306 give an introduction to LSTM fundamentals.

Reservoir Computing #

Reservoir computing was introduced as a computational framework that uses an RNN (also called a reservoir) whereby the input and hidden layers of the network are randomly initialised and left as is. Only the output mapping is trained using simple linear regression, often ridge regression. This idea was independently introduced by Citation:  () (). The" echo state" approach to analysing and training recurrent neural networks-with an erratum note'. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148. (Citation: , & al., , & (). Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations. Neural Computation, 14(11). 2531–2560. https://doi.org/10.1162/089976602760407955 ; Citation: , ) (). Backpropagation-decorrelation: online recurrent learning with O(N) complexity. https://doi.org/10.1109/IJCNN.2004.1380039 and radically simplifies training RNN. While very similar ideas were already introduced in the nineties, they failed to attract significant interest from a wide scientific audience (Citation: , ) (). A neural oscillator-network model of temporal pattern generation. Human Movement Science, 11(1). 181–192. https://doi.org/10.1016/0167-9457(92)90059-K .

Citation:  () (). The" echo state" approach to analysing and training recurrent neural networks-with an erratum note'. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148. approached the issue of systematic training RNN with a general architecture. He proposed to only train the output layer, leaving the randomly initialised connectivity matrix as is. This effectively reduces the whole training procedure to a one-shot linear regression. The network must satisfy the echo-state property (also called fading memory), meaning that the current state is mostly dependent on previous inputs with decreasing importance as the time difference becomes larger. As a result, inputs in the distant past do not affect the current state of the network. A network that satisfies this property is called an echo state network (ESN).

Citation: , & al. () , & (). Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations. Neural Computation, 14(11). 2531–2560. https://doi.org/10.1162/089976602760407955 were instead investigating a biologically plausible implementation of the brain using spiking neurons. Their liquid state machine (LSM) has a similar structure as an ESN but works in both continuous and discrete-time on spike trains (vs. discrete-time signals only for ESN). Citation: , & al. () , & (). Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations. Neural Computation, 14(11). 2531–2560. https://doi.org/10.1162/089976602760407955 identified two necessary properties that the network should meet in order to function as a universal approximator: the separation and approximation property. The (point-wise) separation property states that different input sequences generate different internal trajectories, while the approximation property requires that the readout function should be able to approximate any function on a closed and bounded domain with arbitrary precision. An illustration of a spike train and spiking neuron is depicted in Figure 2.13.

Illustration of two spike trains and a spiking neuron.
Figure 2.13: Illustration of two spike trains and a spiking neuron. The input spike train is processed by the neuron, yielding the output. A network is such neurons is called an LSM.

The third and final approach by Citation:  () (). Backpropagation-decorrelation: online recurrent learning with O(N) complexity. https://doi.org/10.1109/IJCNN.2004.1380039 is a learning rule for RNN: backpropagation-decorrelation (BPDC). While ESN and LSM use one-shot learning, this is an online learning method. Similar to the other cases, only the output is trained.

While all three methods have a different background, they share the same conceptualisation: leave the reservoir as-is and train only the output mapping. This is illustrated in Figure 2.14. One or more inputs is mapped into the reservoir, whose state is used to perform one or more output tasks. Only the red output interconnects (arcs) are trained. The output can optionally be fed back into the reservoir, but this case is not considered since it is not relevant for the work presented here. Because of these large similarities, researchers coined the term reservoir computing, collective noun for these three methods (Citation: , & al., , , & (). An experimental unification of reservoir computing methods. Neural Networks, 20(3). 391–403. https://doi.org/10.1016/j.neunet.2007.04.003 ; Citation: , & al., , & (). An overview of reservoir computing: theory, applications and implementations. Proceedings of the 15th European Symposium on Artificial Neural Networks. p. 471-482 2007. 471–482. Retrieved from http://hdl.handle.net/1854/LU-416607 ; Citation: & , & (). Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3). 127–149. https://doi.org/10.1016/j.cosrev.2009.03.005 ; Citation: , & al., ) , & (). Reservoir Computing Trends. KI - Künstliche Intelligenz, 26(4). 365–371. https://doi.org/10.1007/s13218-012-0204-5 .

General architecture of a RNN in reservoir computing.
Figure 2.14: General architecture of a RNN in reservoir computing. Only the weights in red are trained. All other weights remain fixed.

Even earlier publications with similar concepts can be found in literature , among others the sequential associative memory models (Citation: & , ) & (). Experiments with Sequential Associative Memories. https://doi.org/10.13140/2.1.4429.3924 , neural oscillator network models (Citation: , ) (). A neural oscillator-network model of temporal pattern generation. Human Movement Science, 11(1). 181–192. https://doi.org/10.1016/0167-9457(92)90059-K , context reverberation networks (Citation: , (). Duality in neurocomputational inductive inference: a simulationist perspective. ; Citation: & , ) & (). The neurodynamics of context reverberation learning. IEEE. , cortico-striatal models for context-dependent sequence learning (Citation: , & al., ) , & (). A Model of Corticostriatal Plasticity for Learning Oculomotor Associations and Sequences. Journal of Cognitive Neuroscience, 7(3). 311–336. https://doi.org/10.1162/jocn.1995.7.3.311 , and biological neural network models for temporal pattern discrimination (Citation: & , & (). Temporal information transformed into a spatial code by a neural network with realistic properties. Science (New York, N.Y.), 267(5200). 1028–1030. https://doi.org/10.1126/science.7863330 ; Citation: , & al., ) , , , , , , , & (). Recent advances in physical reservoir computing: A review. Neural Networks, 115. 100–123. https://doi.org/10.1016/j.neunet.2019.03.005 .

How can something that is randomly initialised result in good performing networks? The answer comes from statistics, in the form of kernel support vector machine (SVM). A kernel SVM uses a nonlinear function to transform the input data into a new space, typically a high-dimensional space (feature space) that has attractive properties for the problem under investigation. Linear methods are then applied inside this feature space to solve the problem at hand. Performing this transformation can be very computationally intensive. However, using the kernel trick, SVM avoid explicit computations inside the feature space. The kernel computes a dot product in the feature space directly on the untransformed data. Many different kernels exist, each specifically designed for a certain data type and problem. The only requirement is that the kernel has to be positive definite (Mercer’s condition). This is very useful since we are often not interested in the actual transformed data, only in the dot product.

Reservoirs operate similarly. They also map the input into a new high-dimensional space where extraction of the target data is straightforward using a linear mapping. The difference lies in the fact that the feature space is computed explicitly instead of using the kernel trick. Additionally, RNN also incorporate time information, something that is not possible with classic SVM (Citation: & , ) & (). Recurrent Kernel Machines: Computing with Infinite Echo State Networks. Neural Computation, 24(1). 104–133. https://doi.org/10.1162/NECO_a_00200 .

Not only are reservoirs easily trained, but they are also well suited for parallel computations. When training a standard RNN using backpropagation, the system is first trained to solve a single task and has to subsequently learn a second task. This sequential training limits generalisation since the system might unlearn the first task partially. Reservoir computing does not suffer from this problem since the reservoir is left as is. One just has to create an additional mapping from the reservoir towards the output, as in Figure 2.14. Adding or removing an output task does not alter the performance of the others.

While all three approaches mentioned previously can be used for general computation, ESN are the easiest to reason on because of their discrete-time nature and continuous variables. As a result, we focus mostly on networks of the ESN type since they resemble hardware-based implementations more closely.

A mathematical description of an input-driven ESN is:

\begin{aligned} \text{x}(n+1) & = (1-\alpha)\text{x}(n) + \alpha \text{f}{\left(\text{W}^{\mathrm{in}}\text{u}(n) + \text{W}\text{x}(n)\right)} \\ \text{y}(n) & = \text{W}^{\mathrm{out}} \text{x}(n)\text{.} \end{aligned}

We assumed the input and output target to be appropriately centred (typically zero-centred), such that bias vectors are negligible. \text{W}^{\mathrm{in}}, \text{W} and \text{W}^{\mathrm{out}} are the weight vectors of the system. Only \text{W}^{\mathrm{out}} is trained, for example with one of the techniques discussed in Linear Regression Models. While the other two matrices are left unaltered, that does not mean that they are fully random. Important optimisation parameters (hyperparameters) are the spectral radius and sparsity of \text{W}, input scaling using \text{W}^{\mathrm{in}} and leaking decay rate \alpha. Often the tangent hyperbolic (\tanh) is selected as nonlinearity and applied element-wise. For a more practical and complete introduction to training ESN, see Citation:  () (). A Practical Guide to Applying Echo State Networks. InMontavon, G., Orr, G. & Müller, K. (Eds.), Neural Networks: Tricks of the Trade: Second Edition. (pp. 659–686). Springer. https://doi.org/10.1007/978-3-642-35289-8_36 .

The reservoirs discussed so far have always been input-driven reservoirs. This means that an input signal is fed into the reservoir. Yet, this need not be always the case. For instance, in robotics, central pattern generators can be used for locomotion (Citation: , & al., ) , , & (). Calibration method to improve transfer from simulation to quadruped robots. Springer. https://doi.org/10.1007/978-3-319-97628-0_9 . They are essential to creating rhythmic signals, which can, in turn, be used to actuate motors in quadruped robots (Citation: & , & (). Central pattern generators and the control of rhythmic movements. Current Biology, 11(23). R986–R996. https://doi.org/10.1016/S0960-9822(01)00581-4 ; Citation: , ) (). Central pattern generators for locomotion control in animals and robots: A review. Neural Networks, 21(4). 642–653. https://doi.org/10.1016/j.neunet.2008.03.014 . Additionally, the core of a reservoir was so-far assumed to be based on an ESN, but this is not a requirement. Citation: , & al. () , , , , , , , & (). Information processing using a single dynamical node as complex system. Nature Communications, 2. 468. https://doi.org/10.1038/ncomms1476 showed that a delay-line with a single nonlinear node has similar properties as RNN.

Physical Reservoir Computing #

Reservoir computing inherently relies on the properties of the underlying dynamical system. They can be implemented using a perceptron network with feedback, but this need not be the case. Any dynamical system that satisfies the properties detailed above can be used as a reservoir. Reservoirs are left as-is, and only the readout part is tuned. Consequently, a wide range of physical dynamical systems can also be used for computation. When using a physical medium, one is performing physical reservoir computing (PRC).

Using the PRC framework, it is possible to use a dynamical system for computation (Citation: , ) (). Physical reservoir computing—an introductory perspective. Japanese Journal of Applied Physics, 59(6). 060501. https://doi.org/10.35848/1347-4065/ab8d4f . Thus, instead of performing control tasks in software, they are now performed in hardware, using either material or structural properties of the system. Morphological computation is a commonly used alternative name for PRC in compliant robotics. Yet, this is not the same (see the next section).

However, while any system can be used, successful computation inherently relies on repeatability. Similar inputs should lead to similar outputs. Indeed, if this were not the case, the system would behave in unexpected ways with the same input, leading to chaotic behaviour. Simply put, the time window of previous inputs that affect the reservoir dynamics should be limited. This corresponds to the echo state property of ESN. The conceptualisation of reservoir computing using ESN closely matches the hardware-based descriptions we are interested in, so we will use the conventions and definitions provided by Citation:  () (). The" echo state" approach to analysing and training recurrent neural networks-with an erratum note'. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148. (Citation: & , ) & (). Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science, 304(5667). 78–80. https://doi.org/10.1126/science.1091277 .

In their review, Citation: , & al. () , , , , , , , & (). Recent advances in physical reservoir computing: A review. Neural Networks, 115. 100–123. https://doi.org/10.1016/j.neunet.2019.03.005 identify four key characteristics that a physical reservoir should have: (i) high dimensionality, (ii) nonlinearity, (iii) fading memory and (iv) separation property. While these are useful for the selection of reservoirs, they offer no means of designing better reservoirs that achieve high computational performance. Moreover, researchers still rely on task-specific benchmarks to compare how much computational power can be attained by individual PRC systems (Citation: , & al., , , & (). Information Processing Capacity of Dynamical Systems. Scientific Reports, 2(1). 1–7. https://doi.org/10.1038/srep00514 ; Citation: , & al., ) , , & (). Information processing via physical soft body. Scientific Reports, 5. 10487. https://doi.org/10.1038/srep10487 . Examples of such tasks include the nonlinear auto-regressive moving average (NARMA) task (Citation: & , ) & (). New results on recurrent network training: unifying the algorithms and accelerating convergence. IEEE Transactions on Neural Networks, 11(3). 697–709. https://doi.org/10.1109/72.846741 and Santa-Fe tasks (Citation: & , ) & (). The Future of Time Series. Xerox Corporation, Palo Alto Research Center. Retrieved from http://indico.ictp.it/event/a02278/contribution/5/material/0/0.pdf . This highlights that there is still plenty to be discovered in the field of PRC. A unifying theoretical framework would help to identify objective performance characteristics.

Nonetheless, the inability to indicate sub-optimality in the reservoir has not withheld a wide range op physical media to be exploited for computation. Examples can be found in integrated analogue circuits (Citation: , & al., ) , & (). Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid. InSaul, L., Weiss, Y. & Bottou, L. (Eds.), Advances in Neural Information Processing Systems 17. (pp. 1201–1208). MIT Press. Retrieved from http://papers.nips.cc/paper/2562-edge-of-chaos-computation-in-mixed-mode-vlsi-a-hard-liquid.pdf , memristive devices (Citation: , & al., ) , , , , & (). Reservoir computing using dynamic memristors for temporal information processing. Nature Communications, 8(1). 2204. https://doi.org/10.1038/s41467-017-02337-y , integrated photonic devices (Citation: , & al., ) , , , , , , , & (). Experimental demonstration of reservoir computing on a silicon photonics chip. Nature Communications, 5. 3541. https://doi.org/10.1038/ncomms4541 , opto-electronics (Citation: , & al., ) , , , , , & (). Optoelectronic Reservoir Computing. Scientific Reports, 2(1). 287. https://doi.org/10.1038/srep00287 , a water bucket (Citation: & , ) & (). Pattern Recognition in a Bucket. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39432-7_63 , a soft silicone arm (Citation: , & al., ) , , & (). Information processing via physical soft body. Scientific Reports, 5. 10487. https://doi.org/10.1038/srep10487 , compliant robots (Citation: , & al., , & (). Exploiting body dynamics for controlling a running quadruped robot. https://doi.org/10.1109/ICAR.2005.1507417 ; Citation: , & al., ) , , , & (). Capillary self-alignment assisted hybrid robotic handling for ultra-thin die stacking. https://doi.org/10.1109/ICRA.2013.6630754 , tensegrity robots (Citation: , & al., , , & (). Locomotion without a brain: physical reservoir computing in tensegrity structures. Artificial Life, 19(1). 35–66. https://doi.org/10.1162/ARTL_a_00080 ; Citation: , & al., , , , , , & (). Design and control of compliant tensegrity robots through simulation and hardware validation. Journal of The Royal Society Interface, 11(98). 20140520. https://doi.org/10.1098/rsif.2014.0520 ; Citation: , & al., ) , , , & (). Terrain Classification for a Quadruped Robot. https://doi.org/10.1109/ICMLA.2013.39 and living organisms (Citation: , & al., , , , , , & (). A CMOS-based microelectrode array for interaction with neuronal cultures. Journal of Neuroscience Methods, 164(1). 93–106. https://doi.org/10.1016/j.jneumeth.2007.04.006 ; Citation: , & al., ) , , & (). Is there a Liquid State Machine in the Bacterium Escherichia Coli?. https://doi.org/10.1109/ALIFE.2007.367795 .

Recently, a new book on reservoir computing and PRC was published by Citation: &  () Nakajima, K. & Fischer, I. (). Reservoir Computing: Theory, Physical Implementations, and Applications. Springer Singapore. https://doi.org/10.1007/978-981-13-1687-6 . While knowledge of this book is not necessary to understand the work presented here, it offers additional insights, theory and context that are out of scope here.

Morphological Computation #

Morphological computation is the potential of any physical body to take part in computation and problem solving (Citation: , & al., ) , , , & (). Towards a theoretical foundation for morphological computation with compliant bodies. Biological Cybernetics, 105(5). 355–370. https://doi.org/10.1007/s00422-012-0471-0 . Instead of working around the dynamics of a physical body in control problems, it should be embraced and leveraged to simplify control. The idea of morphological computation also aligns well with the idea of embodiment. The control (or brain) and body cannot be studied separately since they are both needed to interact with the physical world (Citation: & , ) & (). Morphological Computation – Connecting Brain, Body, and Environment. InSendhoff, B., Körner, E., Sporns, O., Ritter, H. & Doya, K. (Eds.), Creating Brain-Like Intelligence: From Basic Principles to Complex Intelligent Systems. (pp. 66–83). Springer. https://doi.org/10.1007/978-3-642-00616-6_5 . Clearly, this conceptualisation is closely related to PRC.

Plants are prime examples of organisms that perform morphological computations. For instance, they grow to discover new food sources and alleviate stress. Other examples in living organisms are widespread, including slime mould (Citation: , & al., ) , & (). On the role of the plasmodial cytoskeleton in facilitating intelligent behavior in slime mold Physarum polycephalum. Communicative & Integrative Biology, 8(4). e1059007. https://doi.org/10.1080/19420889.2015.1059007 and animals (Citation: & , ) & (). What Is Morphological Computation? On How the Body Contributes to Cognition and Control. Artificial Life, 23(1). 1–24. https://doi.org/10.1162/ARTL_a_00219 .

While morphological computation is an interesting conceptualisation, it does not provide a general framework for computation, though field-specific formalisations exist (e.g., Citation: , & al. () , , , & (). Towards a theoretical foundation for morphological computation with compliant bodies. Biological Cybernetics, 105(5). 355–370. https://doi.org/10.1007/s00422-012-0471-0 ). However, we can rely on the closely related framework of PRC for a theoretical foundation.

Summary #

In this chapter, we introduced essential concepts from machine learning, necessary to understand reservoir computing and PRC. While reservoir computing is inspired by the brain and often uses an RNN as a reservoir, the main learning rule is based on linear regression. Elementary concepts such as regularisation, data splitting and hyperparameters are introduced with clear examples. Based on this foundation, one should be able to comprehend the analysis presented in later chapters.

In later chapters, we will build on these foundations to explore PRC with plants. To conclude, we want to stress that while reservoir computing is a general machine learning techniques that can be applied to time-series data, PRC is not. PRC is about using a physical medium for computation. Consequently, PRC is more about a shift in paradigm: where computation can occur and which type of medium can be used for computation.