home»research»papers»terborg05smooth (as pdf)

Smooth Bayesian Kernel Machines

Rutger W. ter Borg1 and Léon J.M. Rothkrantz2

1Nuon NV, Applied Research & Technology

Spaklerweg 20, 1096 BA Amsterdam, the Netherlands

obfuscated email address

2Delft University of Technology

Mekelweg 4, 2628 CD Delft, the Netherlands

obfuscated email address

Abstract. In this paper, we consider the possibility of obtaining a kernel machine that is sparse in feature space and smooth in output space. Smooth in output space implies that the underlying function is supposed to have continuous derivatives up to some order. Smoothness is achieved by applying a roughness penalty, a concept from the area of functional data analysis. Sparseness is taken care of by automatic relevance determination. Both are combined in a Bayesian model, which has been implemented and tested. Test results are presented in the paper.

1 Introduction

In tasks such as time series modelling and system control engineering, representing observed data per se does not draw the primary interest, but rather how and how rapidly a system responds to certain events, i.e. the behaviour of derivatives of functions describing such a system.

Kernel machines have become a popular tool to model measured data with. Emerged by the combination of several disciplines, they have in common that they combine the kernel trick [1] and the principle of parsimony [23]. The latter is brought forth by obtaining a sparse model that utilises a small subset of the data to represent a function with. However, as yet no special attention has been paid to the smoothness of that function3, i.e., it should have continuous derivatives up to some order. In case of regression, we want the resulting function to be smooth, and in case of a classification problem, the resulting decision boundary should be smooth.

The remainder of this paper is organised as follows. In section 2, we introduce derivative kernels, and kernel roughness penalties. We note that through penalised regularisation, roughness penalties do not lead to sparseness. In section 3, we introduce a novel Bayesian prior, its related model, and update equations. Section 4 shows experimental results, and section 5 concludes the paper.

2 Smooth Functional Representations

In supervised learning, we consider a data set eqn_tbjs containing eqn_tclw input-output pairs eqn_7g70, with eqn_0j1j typically containing multidimensional vectors in eqn_en8z, and eqn_p4kl representing either classes in case of classification, or scalars in eqn_4dd9 in case of regression. Wahba [5] shows that we can represent our data eqn_98tz using a linear model of the form

(1)
eqn_r1vr

with bias eqn_0ayq, and parameters eqn_lupg, and a kernel function eqn_bkyq which, under certain conditions, defines an inner product in feature space, eqn_l3g9.

In the field of functional data analysis, smoothness is favoured explicitly by applying a roughness penalty [6], a penalty on the degree of curvature of one or more derivatives of eqn_km6z. Fortunately, as eqn_km6z in (1) is linear, we can get to its derivatives by establishing the derivatives of the kernel functions eqn_7e3x.

2.1 Derivative Kernels

For the eqn_dpoe-th power of a vector eqn_s0bp, we define eqn_h2qd for eqn_dpoe even, and eqn_lj0s for eqn_dpoe odd. Using this definition, the dimension of the eqn_dpoe-th order derivative of a kernel eqn_mgmr is either eqn_sbbo for eqn_dpoe is even, or eqn_ltpo for eqn_dpoe being an odd number, where eqn_ltpo is the dimension of underlying vectors eqn_s0bp and eqn_fb4y. Because a kernel is a mapping eqn_f6q9, derivative kernels exist for eqn_dpoe even or for eqn_4wsa (or both). We discuss derivatives of the commonly used Gaussian kernel and derivatives of the polynomial kernel.

Derivatives of the Gaussian kernel eqn_pnta are identified by Rodrigues’ formula for Hermite polynomials

(2)
eqn_vv4x

By substituting eqn_ys96 for eqn_kk02 in (2), and keeping track of additional terms of eqn_z9yb, we arrive at the convenient compact form of the derivatives of Gaussian kernels

eqn_s37l

Derivatives of the polynomial kernel eqn_wyi0 are found to be

eqn_njm8

which is valid in this instantiation as long as eqn_clg3.

2.2 Kernel Roughness Penalties

We use a kernel matrix eqn_87nf with entries eqn_kl6c for inputs eqn_x4bb, and a design matrix eqn_zz5v that consists of the combination of eqn_q8h0 and a kernel matrix. For example, in a regularisation setting, applying a roughness penalty can be accomplished by penalising the summed curvature of the second order derivatives

(3)
eqn_ujyp

with eqn_lbeu being a design matrix of a second order derivative kernel. For eqn_v1dl, this imposes an ordinary least squares, and for eqn_lw3w the result will approximate a straight line. Besides a penalty on the second order derivative as shown in (3), more generally, a linear differential operator

(4)
eqn_icq7

can be defined [7], and used to form a weighted sum of penalties on derivatives of eqn_km6z. If applied in penalised regularisation, eqn_poa1 takes the shape

(5)
eqn_iuru

which is solved by eqn_kk50. Despite reasonable conditions of matrices eqn_ge1t and eqn_s495, in practice, this system of equations is already singular in case of duplicate inputs. In the field of functional data analysis, this is addressed by pre-processing, such as starting with the dominant frequencies [8]. Also, the selection of parameter eqn_moax is done manually by methods such as cross-validation.

Although the resulting functions are smooth, regularisation as in (5) does not promote sparseness. We would like to have a representation that is both smooth and sparse, and in addition, an automatic choice of parameter eqn_moax.

3 Smooth Relevance Vector Machine

We combine the idea of regularisation by a roughness penalty with a Bayesian sparseness inducing prior. We presume that the outputs are corrupted by Gaussian noise eqn_7hiu. To promote both smoothness and sparseness, we propose a zero-mean Gaussian prior over the weights, and a covariance matrix consisting of two terms

(6)
eqn_uvev

The first term eqn_7myp is the smoothness promoting part, formed by a roughness penalty matrix defined by a linear differential operator eqn_poa1 as in (4). The second term is a diagonal matrix eqn_2jsi containing one hyper-parameter eqn_er8q per weight eqn_fpwn, as in automatic relevance determination (ARD) [9], as applied in the relevance vector machine (RVM) [10].

By applying Bayes’ rule we obtain the posterior distributions

eqn_ky15

with eqn_8zw3, eqn_6f0h and eqn_2jsc.

3.1 Update Rules

To obtain (locally) optimal values for eqn_9n8c, eqn_moax and eqn_sdlz, we will follow a type-II maximum likelihood approach similar to that of the RVM [10]. The update equations for the automatic relevance determination hyper-parameters eqn_er8q are given by

(7)
eqn_zx5u

Note that in the case of eqn_59q7 then eqn_dse3, making (7) identical to the update equation of the original RVM. The roughness-penalty parameter eqn_moax is updated by

(8)
eqn_gkcv

and the variance estimate by

(9)
eqn_zgjj

In practice, we obtain traces of eqn_arbv and eqn_7xbg by evaluating (7). We fully compute eqn_oxox, and obtain the other trace in (8) by eqn_yo40, and that of (9) by using eqn_lese.

4 Experimental Results

To measure the effectiveness of the proposed Bayesian model, we have performed a series of experiments. Because we need availability of the derivatives of the underlying function in order to be able to measure and compare performance, we have chosen the well-known sinc benchmark.

The data for one experiment consist of 50 points, with eqn_tynf uniformly distributed over eqn_ieq2, and eqn_zgve. We trained a series of kernel machines using these data with our implementation4. For our method, we have chosen to penalise the 10th order derivative only, which should effectuate smoothness up to the 8th order [8]. A Gaussian kernel is used by all methods, parametrised by eqn_48lh. We repeated this experiment 1000 times. Table 4 shows the mean and standard deviation of the results, subsequently of the number of basis vectors, and of the root-mean-square of errors of zeroth, second and fourth order derivatives.

Table 1: Comparative performance.

Method BVs eqn_syo4 eqn_99p0 eqn_c2gp
RVM (old) [10] 6.5±1.0 0.047±0.010 0.054±0.014 0.106±0.030
RVM (new) [11] 6.0±1.1 0.049±0.010 0.056±0.013 0.108±0.026
Figueiredo [12] 6.3±1.3 0.052±0.012 0.067±0.022 0.136±0.051
KRLS [13] 17.0±0.0 0.054±0.012 0.077±0.017 0.242±0.066
SOG-SVR [14] 17.0±0.0 0.062±0.010 0.050±0.008 0.062±0.012
AO-SVR [15] 20.7±3.2 0.057±0.011 0.112±0.029 0.374±0.115
SRVM [this paper] 16.3±3.4 0.039±0.009 0.030±0.009 0.033±0.013

As shown in table 4, smoothness requires notably more basis vectors (BVs) than common sparse Bayesian models. On the other hand, the basis vectors seem to be spent wisely, because the error on all derivative orders is consistently lower than that of the compared methods. Figure 4 illustrates a function that is smooth in output space. Although at first sight the fit of the function itself is not dramatically different from others (left), the gain in quality of fit is clearly visible at a higher order derivative (right).

Dzeroth  Dfour

Fig. 1: Input data to a single experiment, the underlying eqn_bn14 function, and the resulting RVM, SRVM, and AO-SVR (left). The 4th order derivative of eqn_bn14, RVM, SRVM, and AO-SVR (right).

5 Conclusions

In this paper we introduced a concept from functional data analysis to promote smoothness in output space. We successfully established derivative kernels, kernel roughness penalties, and a Bayesian model to combine kernel roughness penalties with automatic relevance determination. Experiments elicited that smoothness in output space increased the quality of fit on all measured derivative orders. The required number of basis vectors is not the lowest, but the quality of the approximation to the underlying function is empirically better.

Further studies should include the handling of an automatic selection of several eqn_optl in (4), i.e., a penalty on more than one derivative order. An interesting open theoretical issue is the maximum derivative order on which a signal may be expected, given a data set.

References

1.

Aizerman, M., Braverman, E., Rozonoèr, L.: Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25 (1964) 821–837

2.

Schölkopf, B., Smola, A.: Learning with Kernels. Adaptive Computation and Machine Learning. The MIT Press, Cambridge, Massachusetts, USA (2002)

3.

Vapnik, V.: The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science. Springer-Verlag, New York (1995)

4.

Lee, Y., Mangasarian, O.: SSVM: A smooth support vector machine for classification. Computational Optimization and Applications 20(1) (2001) 5–22

5.

Wahba, G.: Spline models for observational data. Journal of the Royal Statistical Society. Series B 59 (1990) 133–150

6.

Green, P., Silverman, B.: Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman & Hall (1993)

7.

Ramsay, J., Silverman, B.: Applied Functional Data Analysis. Springer-Verlag (2002)

8.

Ramsay, J., Silverman, B.: Functional Data Analysis. Springer Series in Statistics. Springer-Verlag, New York (1997)

9.

MacKay, D.: A practical bayesian framework for backprop networks. Neural Computation 4(3) (1992) 448–472

10.

Tipping, M.: Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1 (2001) 211–244

11.

Tipping, M., Faul, A.: Fast marginal likelihood maximisation for sparse bayesian models. In Bishop, C., Frey, B., eds.: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, January 3–6 2003, Key West, Florida. (2003)

12.

Figueiredo, M.: Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(9) (2003) 1150–1159

13.

Engel, Y., Mannor, S., Meir, R.: The kernel recursive least squares algorithm. ICNC03 001, Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel (2003)

14.

Engel, Y., Mannor, S., Meir, R.: Sparse online greedy support vector regression. In Elomaa, T., Mannila, H., Toivonen, H., eds.: Machine Learning: ECML 2002. Volume 2430 of Lecture Notes in Computer Science., Berlin, Springer-Verlag (2002)

15.

Ma, J., Theiler, J., Perkins, S.: Accurate on-line support vector regression. Neural Computation 15(11) (2003) 2683–2704

  • 3The smooth support vector machine [4] entails a reformulation of the quadratic program of the support vector machine.
  • 4The Kernel-Machine Library is available at http://www.terborg.net/research/kml/.