In tasks such as time series modelling and system control engineering, representing observed data per se does not draw the primary interest, but rather how and how rapidly a system responds to certain events, i.e. the behaviour of derivatives of functions describing such a system.

Kernel machines have become a popular tool to model measured data with. Emerged by the combination of several disciplines, they have in common that they combine the kernel trick [1] and the principle of parsimony [2, 3]. The latter is brought forth by obtaining a sparse model that utilises a small subset of the data to represent a function with. However, as yet no special attention has been paid to the smoothness of that function3, i.e., it should have continuous derivatives up to some order. In case of regression, we want the resulting function to be smooth, and in case of a classification problem, the resulting decision boundary should be smooth.

The remainder of this paper is organised as follows. In section 2, we introduce derivative kernels, and kernel roughness penalties. We note that through penalised regularisation, roughness penalties do not lead to sparseness. In section 3, we introduce a novel Bayesian prior, its related model, and update equations. Section 4 shows experimental results, and section 5 concludes the paper.

In supervised learning, we consider a data set containing input-output pairs , with typically containing multidimensional vectors in , and representing either classes in case of classification, or scalars in in case of regression. Wahba [5] shows that we can represent our data using a linear model of the form

(1)

with bias , and parameters , and a kernel function which, under certain conditions, defines an inner product in feature space, .

In the field of functional data analysis, smoothness is favoured explicitly by applying a roughness penalty [6], a penalty on the degree of curvature of one or more derivatives of . Fortunately, as in (1) is linear, we can get to its derivatives by establishing the derivatives of the kernel functions .

For the -th power of a vector , we define for even, and for odd. Using this definition, the dimension of the -th order derivative of a kernel is either for is even, or for being an odd number, where is the dimension of underlying vectors and . Because a kernel is a mapping , derivative kernels exist for even or for (or both). We discuss derivatives of the commonly used Gaussian kernel and derivatives of the polynomial kernel.

Derivatives of the Gaussian kernel are identified by Rodrigues’ formula for Hermite polynomials

(2)

By substituting for in (2), and keeping track of additional terms of , we arrive at the convenient compact form of the derivatives of Gaussian kernels

Derivatives of the polynomial kernel are found to be

which is valid in this instantiation as long as .

We use a kernel matrix with entries for inputs , and a design matrix that consists of the combination of and a kernel matrix. For example, in a regularisation setting, applying a roughness penalty can be accomplished by penalising the summed curvature of the second order derivatives

(3)

with being a design matrix of a second order derivative kernel. For , this imposes an ordinary least squares, and for the result will approximate a straight line. Besides a penalty on the second order derivative as shown in (3), more generally, a linear differential operator

(4)

can be defined [7], and used to form a weighted sum of penalties on derivatives of . If applied in penalised regularisation, takes the shape

(5)

which is solved by . Despite reasonable conditions of matrices and , in practice, this system of equations is already singular in case of duplicate inputs. In the field of functional data analysis, this is addressed by pre-processing, such as starting with the dominant frequencies [8]. Also, the selection of parameter is done manually by methods such as cross-validation.

Although the resulting functions are smooth, regularisation as in (5) does not promote sparseness. We would like to have a representation that is both smooth and sparse, and in addition, an automatic choice of parameter .

We combine the idea of regularisation by a roughness penalty with a Bayesian sparseness inducing prior. We presume that the outputs are corrupted by Gaussian noise . To promote both smoothness and sparseness, we propose a zero-mean Gaussian prior over the weights, and a covariance matrix consisting of two terms

(6)

The first term is the smoothness promoting part, formed by a roughness penalty matrix defined by a linear differential operator as in (4). The second term is a diagonal matrix containing one hyper-parameter per weight , as in automatic relevance determination (ARD) [9], as applied in the relevance vector machine (RVM) [10].

By applying Bayes’ rule we obtain the posterior distributions

with , and .

To obtain (locally) optimal values for , and , we will follow a type-II maximum likelihood approach similar to that of the RVM [10]. The update equations for the automatic relevance determination hyper-parameters are given by

(7)

Note that in the case of then , making (7) identical to the update equation of the original RVM. The roughness-penalty parameter is updated by

(8)

and the variance estimate by

(9)

In practice, we obtain traces of and by evaluating (7). We fully compute , and obtain the other trace in (8) by , and that of (9) by using .

To measure the effectiveness of the proposed Bayesian model, we have performed a series of experiments. Because we need availability of the derivatives of the underlying function in order to be able to measure and compare performance, we have chosen the well-known sinc benchmark.

The data for one experiment consist of 50 points, with uniformly distributed over , and . We trained a series of kernel machines using these data with our implementation4. For our method, we have chosen to penalise the 10th order derivative only, which should effectuate smoothness up to the 8th order [8]. A Gaussian kernel is used by all methods, parametrised by . We repeated this experiment 1000 times. Table 4 shows the mean and standard deviation of the results, subsequently of the number of basis vectors, and of the root-mean-square of errors of zeroth, second and fourth order derivatives.

Method | BVs | |||

RVM (old) [10] | 6.5±1.0 | 0.047±0.010 | 0.054±0.014 | 0.106±0.030 |

RVM (new) [11] | 6.0±1.1 | 0.049±0.010 | 0.056±0.013 | 0.108±0.026 |

Figueiredo [12] | 6.3±1.3 | 0.052±0.012 | 0.067±0.022 | 0.136±0.051 |

KRLS [13] | 17.0±0.0 | 0.054±0.012 | 0.077±0.017 | 0.242±0.066 |

SOG-SVR [14] | 17.0±0.0 | 0.062±0.010 | 0.050±0.008 | 0.062±0.012 |

AO-SVR [15] | 20.7±3.2 | 0.057±0.011 | 0.112±0.029 | 0.374±0.115 |

SRVM [this paper] | 16.3±3.4 | 0.039±0.009 | 0.030±0.009 | 0.033±0.013 |

As shown in table 4, smoothness requires notably more basis vectors (BVs) than common sparse Bayesian models. On the other hand, the basis vectors seem to be spent wisely, because the error on all derivative orders is consistently lower than that of the compared methods. Figure 4 illustrates a function that is smooth in output space. Although at first sight the fit of the function itself is not dramatically different from others (left), the gain in quality of fit is clearly visible at a higher order derivative (right).

In this paper we introduced a concept from functional data analysis to promote smoothness in output space. We successfully established derivative kernels, kernel roughness penalties, and a Bayesian model to combine kernel roughness penalties with automatic relevance determination. Experiments elicited that smoothness in output space increased the quality of fit on all measured derivative orders. The required number of basis vectors is not the lowest, but the quality of the approximation to the underlying function is empirically better.

Further studies should include the handling of an automatic selection of several in (4), i.e., a penalty on more than one derivative order. An interesting open theoretical issue is the maximum derivative order on which a signal may be expected, given a data set.

1.

Aizerman, M., Braverman, E., Rozonoèr, L.: Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control **25** (1964) 821–837

2.

Schölkopf, B., Smola, A.: Learning with Kernels. Adaptive Computation and Machine Learning. The MIT Press, Cambridge, Massachusetts, USA (2002)

3.

Vapnik, V.: The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science. Springer-Verlag, New York (1995)

4.

Lee, Y., Mangasarian, O.: SSVM: A smooth support vector machine for classification. Computational Optimization and Applications **20**(1) (2001) 5–22

5.

Wahba, G.: Spline models for observational data. Journal of the Royal Statistical Society. Series B **59** (1990) 133–150

6.

Green, P., Silverman, B.: Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman & Hall (1993)

7.

Ramsay, J., Silverman, B.: Applied Functional Data Analysis. Springer-Verlag (2002)

8.

Ramsay, J., Silverman, B.: Functional Data Analysis. Springer Series in Statistics. Springer-Verlag, New York (1997)

9.

MacKay, D.: A practical bayesian framework for backprop networks. Neural Computation **4**(3) (1992) 448–472

10.

Tipping, M.: Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research **1** (2001) 211–244

11.

Tipping, M., Faul, A.: Fast marginal likelihood maximisation for sparse bayesian models. In Bishop, C., Frey, B., eds.: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, January 3–6 2003, Key West, Florida. (2003)

12.

Figueiredo, M.: Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence **25**(9) (2003) 1150–1159

13.

Engel, Y., Mannor, S., Meir, R.: The kernel recursive least squares algorithm. ICNC03 001, Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel (2003)

14.

Engel, Y., Mannor, S., Meir, R.: Sparse online greedy support vector regression. In Elomaa, T., Mannila, H., Toivonen, H., eds.: Machine Learning: ECML 2002. Volume 2430 of Lecture Notes in Computer Science., Berlin, Springer-Verlag (2002)

15.

Ma, J., Theiler, J., Perkins, S.: Accurate on-line support vector regression. Neural Computation **15**(11) (2003) 2683–2704

- 3The smooth support vector machine [4] entails a reformulation of the quadratic program of the support vector machine.
- 4The Kernel-Machine Library is available at http://www.terborg.net/research/kml/.