In tasks such as time series modelling and system control engineering, representing observed data per se does not draw the primary interest, but rather how and how rapidly a system responds to certain events, i.e. the behaviour of derivatives of functions describing such a system.
Kernel machines have become a popular tool to model measured data with. Emerged by the combination of several disciplines, they have in common that they combine the kernel trick [1] and the principle of parsimony [2, 3]. The latter is brought forth by obtaining a sparse model that utilises a small subset of the data to represent a function with. However, as yet no special attention has been paid to the smoothness of that function3, i.e., it should have continuous derivatives up to some order. In case of regression, we want the resulting function to be smooth, and in case of a classification problem, the resulting decision boundary should be smooth.
The remainder of this paper is organised as follows. In section 2, we introduce derivative kernels, and kernel roughness penalties. We note that through penalised regularisation, roughness penalties do not lead to sparseness. In section 3, we introduce a novel Bayesian prior, its related model, and update equations. Section 4 shows experimental results, and section 5 concludes the paper.
In supervised learning, we consider a data set
containing
input-output pairs
, with
typically containing multidimensional vectors in
, and
representing either classes in case of classification, or scalars in
in case of regression. Wahba [5] shows that we can represent our data
using a linear model of the form
with bias
, and parameters
, and a kernel function
which, under certain conditions, defines an inner product in feature space,
.
In the field of functional data analysis, smoothness is favoured explicitly by applying a roughness penalty [6], a penalty on the degree of curvature of one or more derivatives of
. Fortunately, as
in (1) is linear, we can get to its derivatives by establishing the derivatives of the kernel functions
.
For the
-th power of a vector
, we define
for
even, and
for
odd. Using this definition, the dimension of the
-th order derivative of a kernel
is either
for
is even, or
for
being an odd number, where
is the dimension of underlying vectors
and
. Because a kernel is a mapping
, derivative kernels exist for
even or for
(or both). We discuss derivatives of the commonly used Gaussian kernel and derivatives of the polynomial kernel.
Derivatives of the Gaussian kernel
are identified by Rodrigues’ formula for Hermite polynomials
By substituting
for
in (2), and keeping track of additional terms of
, we arrive at the convenient compact form of the derivatives of Gaussian kernels

Derivatives of the polynomial kernel
are found to be

which is valid in this instantiation as long as
.
We use a kernel matrix
with entries
for inputs
, and a design matrix
that consists of the combination of
and a kernel matrix. For example, in a regularisation setting, applying a roughness penalty can be accomplished by penalising the summed curvature of the second order derivatives
with
being a design matrix of a second order derivative kernel. For
, this imposes an ordinary least squares, and for
the result will approximate a straight line. Besides a penalty on the second order derivative as shown in (3), more generally, a linear differential operator
can be defined [7], and used to form a weighted sum of penalties on derivatives of
. If applied in penalised regularisation,
takes the shape
which is solved by
. Despite reasonable conditions of matrices
and
, in practice, this system of equations is already singular in case of duplicate inputs. In the field of functional data analysis, this is addressed by pre-processing, such as starting with the dominant frequencies [8]. Also, the selection of parameter
is done manually by methods such as cross-validation.
Although the resulting functions are smooth, regularisation as in (5) does not promote sparseness. We would like to have a representation that is both smooth and sparse, and in addition, an automatic choice of parameter
.
We combine the idea of regularisation by a roughness penalty with a Bayesian sparseness inducing prior. We presume that the outputs are corrupted by Gaussian noise
. To promote both smoothness and sparseness, we propose a zero-mean Gaussian prior over the weights, and a covariance matrix consisting of two terms
The first term
is the smoothness promoting part, formed by a roughness penalty matrix defined by a linear differential operator
as in (4). The second term is a diagonal matrix
containing one hyper-parameter
per weight
, as in automatic relevance determination (ARD) [9], as applied in the relevance vector machine (RVM) [10].
By applying Bayes’ rule we obtain the posterior distributions

with
,
and
.
To obtain (locally) optimal values for
,
and
, we will follow a type-II maximum likelihood approach similar to that of the RVM [10]. The update equations for the automatic relevance determination hyper-parameters
are given by
Note that in the case of
then
, making (7) identical to the update equation of the original RVM. The roughness-penalty parameter
is updated by
and the variance estimate by
In practice, we obtain traces of
and
by evaluating (7). We fully compute
, and obtain the other trace in (8) by
, and that of (9) by using
.
To measure the effectiveness of the proposed Bayesian model, we have performed a series of experiments. Because we need availability of the derivatives of the underlying function in order to be able to measure and compare performance, we have chosen the well-known sinc benchmark.
The data for one experiment consist of 50 points, with
uniformly distributed over
, and
. We trained a series of kernel machines using these data with our implementation4. For our method, we have chosen to penalise the 10th order derivative only, which should effectuate smoothness up to the 8th order [8]. A Gaussian kernel is used by all methods, parametrised by
. We repeated this experiment 1000 times. Table 4 shows the mean and standard deviation of the results, subsequently of the number of basis vectors, and of the root-mean-square of errors of zeroth, second and fourth order derivatives.
Table 1: Comparative performance.
| Method | BVs | ![]() |
![]() |
![]() |
| RVM (old) [10] | 6.5±1.0 | 0.047±0.010 | 0.054±0.014 | 0.106±0.030 |
| RVM (new) [11] | 6.0±1.1 | 0.049±0.010 | 0.056±0.013 | 0.108±0.026 |
| Figueiredo [12] | 6.3±1.3 | 0.052±0.012 | 0.067±0.022 | 0.136±0.051 |
| KRLS [13] | 17.0±0.0 | 0.054±0.012 | 0.077±0.017 | 0.242±0.066 |
| SOG-SVR [14] | 17.0±0.0 | 0.062±0.010 | 0.050±0.008 | 0.062±0.012 |
| AO-SVR [15] | 20.7±3.2 | 0.057±0.011 | 0.112±0.029 | 0.374±0.115 |
| SRVM [this paper] | 16.3±3.4 | 0.039±0.009 | 0.030±0.009 | 0.033±0.013 |
As shown in table 4, smoothness requires notably more basis vectors (BVs) than common sparse Bayesian models. On the other hand, the basis vectors seem to be spent wisely, because the error on all derivative orders is consistently lower than that of the compared methods. Figure 4 illustrates a function that is smooth in output space. Although at first sight the fit of the function itself is not dramatically different from others (left), the gain in quality of fit is clearly visible at a higher order derivative (right).

Fig. 1: Input data to a single experiment, the underlying
function, and the resulting RVM, SRVM, and AO-SVR (left). The 4th order derivative of
, RVM, SRVM, and AO-SVR (right).
In this paper we introduced a concept from functional data analysis to promote smoothness in output space. We successfully established derivative kernels, kernel roughness penalties, and a Bayesian model to combine kernel roughness penalties with automatic relevance determination. Experiments elicited that smoothness in output space increased the quality of fit on all measured derivative orders. The required number of basis vectors is not the lowest, but the quality of the approximation to the underlying function is empirically better.
Further studies should include the handling of an automatic selection of several
in (4), i.e., a penalty on more than one derivative order. An interesting open theoretical issue is the maximum derivative order on which a signal may be expected, given a data set.
Aizerman, M., Braverman, E., Rozonoèr, L.: Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25 (1964) 821–837
Schölkopf, B., Smola, A.: Learning with Kernels. Adaptive Computation and Machine Learning. The MIT Press, Cambridge, Massachusetts, USA (2002)
Vapnik, V.: The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science. Springer-Verlag, New York (1995)
Lee, Y., Mangasarian, O.: SSVM: A smooth support vector machine for classification. Computational Optimization and Applications 20(1) (2001) 5–22
Wahba, G.: Spline models for observational data. Journal of the Royal Statistical Society. Series B 59 (1990) 133–150
Green, P., Silverman, B.: Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman & Hall (1993)
Ramsay, J., Silverman, B.: Applied Functional Data Analysis. Springer-Verlag (2002)
Ramsay, J., Silverman, B.: Functional Data Analysis. Springer Series in Statistics. Springer-Verlag, New York (1997)
MacKay, D.: A practical bayesian framework for backprop networks. Neural Computation 4(3) (1992) 448–472
Tipping, M.: Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1 (2001) 211–244
Tipping, M., Faul, A.: Fast marginal likelihood maximisation for sparse bayesian models. In Bishop, C., Frey, B., eds.: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, January 3–6 2003, Key West, Florida. (2003)
Figueiredo, M.: Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(9) (2003) 1150–1159
Engel, Y., Mannor, S., Meir, R.: The kernel recursive least squares algorithm. ICNC03 001, Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel (2003)
Engel, Y., Mannor, S., Meir, R.: Sparse online greedy support vector regression. In Elomaa, T., Mannila, H., Toivonen, H., eds.: Machine Learning: ECML 2002. Volume 2430 of Lecture Notes in Computer Science., Berlin, Springer-Verlag (2002)
Ma, J., Theiler, J., Perkins, S.: Accurate on-line support vector regression. Neural Computation 15(11) (2003) 2683–2704