.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/ml/plot_kernel_regression.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_ml_plot_kernel_regression.py: Kernel Regression ================= In this example we will see and compare the performance of different kernel regression methods. .. GENERATED FROM PYTHON SOURCE LINES 8-12 .. code-block:: Python # Author: Elena Petrunina # License: MIT .. GENERATED FROM PYTHON SOURCE LINES 13-17 For this example, we will use the :func:`tecator ` dataset. This data set contains 215 samples. For each sample the data consists of a spectrum of absorbances and the contents of water, fat and protein. .. GENERATED FROM PYTHON SOURCE LINES 17-24 .. code-block:: Python from skfda.datasets import fetch_tecator X_df, y_df = fetch_tecator(return_X_y=True, as_frame=True) X = X_df.iloc[:, 0].array fat = y_df["fat"].to_numpy() .. GENERATED FROM PYTHON SOURCE LINES 30-33 Fat percentages will be estimated from the spectrum. All curves are shown in the image above. The color of these depends on the amount of fat, from least (yellow) to highest (red). .. GENERATED FROM PYTHON SOURCE LINES 33-39 .. code-block:: Python import matplotlib.pyplot as plt X.plot(gradient_criteria=fat, legend=True) plt.show() .. image-sg:: /auto_examples/ml/images/sphx_glr_plot_kernel_regression_001.png :alt: Spectrometric curves :srcset: /auto_examples/ml/images/sphx_glr_plot_kernel_regression_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 40-42 The data set is splitted into train and test sets with 80% and 20% of the samples respectively. .. GENERATED FROM PYTHON SOURCE LINES 42-52 .. code-block:: Python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, fat, test_size=0.2, random_state=1, ) .. GENERATED FROM PYTHON SOURCE LINES 53-56 The KNN hat matrix will be tried first. We will use the default kernel function, i.e. uniform kernel. To find the most suitable number of neighbours GridSearchCV will be used, testing with any number from 1 to 100. .. GENERATED FROM PYTHON SOURCE LINES 56-70 .. code-block:: Python import numpy as np from sklearn.model_selection import GridSearchCV from skfda.misc.hat_matrix import KNeighborsHatMatrix from skfda.ml.regression._kernel_regression import KernelRegression n_neighbors = np.array(range(1, 100)) knn = GridSearchCV( KernelRegression(kernel_estimator=KNeighborsHatMatrix()), param_grid={"kernel_estimator__n_neighbors": n_neighbors}, ) .. GENERATED FROM PYTHON SOURCE LINES 71-73 The best performance for the train set is obtained with the following number of neighbours .. GENERATED FROM PYTHON SOURCE LINES 73-80 .. code-block:: Python knn.fit(X_train, y_train) print( "KNN bandwidth:", knn.best_params_["kernel_estimator__n_neighbors"], ) .. rst-class:: sphx-glr-script-out .. code-block:: none KNN bandwidth: 3 .. GENERATED FROM PYTHON SOURCE LINES 81-83 The accuracy of the estimation using r2_score measurement on the test set is shown below. .. GENERATED FROM PYTHON SOURCE LINES 83-91 .. code-block:: Python from sklearn.metrics import r2_score y_pred = knn.predict(X_test) knn_res = r2_score(y_pred, y_test) print("Score KNN:", knn_res) .. rst-class:: sphx-glr-script-out .. code-block:: none Score KNN: 0.3500795818805428 .. GENERATED FROM PYTHON SOURCE LINES 92-94 Following a similar procedure for Nadaraya-Watson, the optimal parameter is chosen from the interval (0.01, 1). .. GENERATED FROM PYTHON SOURCE LINES 94-103 .. code-block:: Python from skfda.misc.hat_matrix import NadarayaWatsonHatMatrix bandwidth = np.logspace(-2, 0, num=100) nw = GridSearchCV( KernelRegression(kernel_estimator=NadarayaWatsonHatMatrix()), param_grid={"kernel_estimator__bandwidth": bandwidth}, ) .. GENERATED FROM PYTHON SOURCE LINES 104-105 The best performance is obtained with the following bandwidth .. GENERATED FROM PYTHON SOURCE LINES 105-112 .. code-block:: Python nw.fit(X_train, y_train) print( "Nadaraya-Watson bandwidth:", nw.best_params_["kernel_estimator__bandwidth"], ) .. rst-class:: sphx-glr-script-out .. code-block:: none Nadaraya-Watson bandwidth: 0.37649358067924693 .. GENERATED FROM PYTHON SOURCE LINES 113-115 The accuracy of the estimation is shown below and should be similar to that obtained with the KNN method. .. GENERATED FROM PYTHON SOURCE LINES 115-120 .. code-block:: Python y_pred = nw.predict(X_test) nw_res = r2_score(y_pred, y_test) print("Score NW:", nw_res) .. rst-class:: sphx-glr-script-out .. code-block:: none Score NW: 0.3127155617541538 .. GENERATED FROM PYTHON SOURCE LINES 121-128 For Local Linear Regression, FDataBasis representation with a basis should be used (for the previous cases it is possible to use either FDataGrid or FDataBasis). For basis, Fourier basis with 10 elements has been selected. Note that the number of functions in the basis affects the estimation result and should ideally also be chosen using cross-validation. .. GENERATED FROM PYTHON SOURCE LINES 128-150 .. code-block:: Python from skfda.misc.hat_matrix import LocalLinearRegressionHatMatrix from skfda.representation.basis import FourierBasis fourier = FourierBasis(n_basis=10) X_basis = X.to_basis(basis=fourier) X_basis_train, X_basis_test, y_train, y_test = train_test_split( X_basis, fat, test_size=0.2, random_state=1, ) bandwidth = np.logspace(0.3, 1, num=100) llr = GridSearchCV( KernelRegression(kernel_estimator=LocalLinearRegressionHatMatrix()), param_grid={"kernel_estimator__bandwidth": bandwidth}, ) .. GENERATED FROM PYTHON SOURCE LINES 151-152 The bandwidth obtained by cross-validation is indicated below. .. GENERATED FROM PYTHON SOURCE LINES 152-158 .. code-block:: Python llr.fit(X_basis_train, y_train) print( "LLR bandwidth:", llr.best_params_["kernel_estimator__bandwidth"], ) .. rst-class:: sphx-glr-script-out .. code-block:: none LLR bandwidth: 4.728762199830451 .. GENERATED FROM PYTHON SOURCE LINES 159-161 Although it is a slower method, the result obtained in this example should be better than in the case of Nadaraya-Watson and KNN. .. GENERATED FROM PYTHON SOURCE LINES 161-166 .. code-block:: Python y_pred = llr.predict(X_basis_test) llr_res = r2_score(y_pred, y_test) print("Score LLR:", llr_res) .. rst-class:: sphx-glr-script-out .. code-block:: none Score LLR: 0.9731955244187162 .. GENERATED FROM PYTHON SOURCE LINES 167-171 For this data set using the derivative should give a better performance. Below the plot of all the derivatives can be found. The same scheme as before is followed: yellow les fat, red more. .. GENERATED FROM PYTHON SOURCE LINES 171-183 .. code-block:: Python Xd = X.derivative() Xd.plot(gradient_criteria=fat, legend=True) plt.show() Xd_train, Xd_test, y_train, y_test = train_test_split( Xd, fat, test_size=0.2, random_state=1, ) .. image-sg:: /auto_examples/ml/images/sphx_glr_plot_kernel_regression_002.png :alt: Spectrometric curves :srcset: /auto_examples/ml/images/sphx_glr_plot_kernel_regression_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 184-186 Exactly the same operations are repeated, but now with the derivatives of the functions. .. GENERATED FROM PYTHON SOURCE LINES 188-189 K-Nearest Neighbours .. GENERATED FROM PYTHON SOURCE LINES 189-205 .. code-block:: Python knn = GridSearchCV( KernelRegression(kernel_estimator=KNeighborsHatMatrix()), param_grid={"kernel_estimator__n_neighbors": n_neighbors}, ) knn.fit(Xd_train, y_train) print( "KNN bandwidth:", knn.best_params_["kernel_estimator__n_neighbors"], ) y_pred = knn.predict(Xd_test) dknn_res = r2_score(y_pred, y_test) print("Score KNN:", dknn_res) .. rst-class:: sphx-glr-script-out .. code-block:: none KNN bandwidth: 4 Score KNN: 0.9428247359478524 .. GENERATED FROM PYTHON SOURCE LINES 206-207 Nadaraya-Watson .. GENERATED FROM PYTHON SOURCE LINES 207-223 .. code-block:: Python bandwidth = np.logspace(-3, -1, num=100) nw = GridSearchCV( KernelRegression(kernel_estimator=NadarayaWatsonHatMatrix()), param_grid={"kernel_estimator__bandwidth": bandwidth}, ) nw.fit(Xd_train, y_train) print( "Nadara-Watson bandwidth:", nw.best_params_["kernel_estimator__bandwidth"], ) y_pred = nw.predict(Xd_test) dnw_res = r2_score(y_pred, y_test) print("Score NW:", dnw_res) .. rst-class:: sphx-glr-script-out .. code-block:: none Nadara-Watson bandwidth: 0.006135907273413175 Score NW: 0.9491787548158307 .. GENERATED FROM PYTHON SOURCE LINES 224-226 For both Nadaraya-Watson and KNN the accuracy has improved significantly and should be higher than 0.9. .. GENERATED FROM PYTHON SOURCE LINES 228-229 Local Linear Regression .. GENERATED FROM PYTHON SOURCE LINES 229-253 .. code-block:: Python Xd_basis = Xd.to_basis(basis=fourier) Xd_basis_train, Xd_basis_test, y_train, y_test = train_test_split( Xd_basis, fat, test_size=0.2, random_state=1, ) bandwidth = np.logspace(-2, 1, 100) llr = GridSearchCV( KernelRegression(kernel_estimator=LocalLinearRegressionHatMatrix()), param_grid={"kernel_estimator__bandwidth": bandwidth}, ) llr.fit(Xd_basis_train, y_train) print( "LLR bandwidth:", llr.best_params_["kernel_estimator__bandwidth"], ) y_pred = llr.predict(Xd_basis_test) dllr_res = r2_score(y_pred, y_test) print("Score LLR:", dllr_res) .. rst-class:: sphx-glr-script-out .. code-block:: none LLR bandwidth: 0.010722672220103232 Score LLR: 0.9949460304758446 .. GENERATED FROM PYTHON SOURCE LINES 254-256 LLR accuracy has also improved, but the difference with Nadaraya-Watson and KNN in the case of derivatives is less significant than in the previous case. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 6.265 seconds) .. _sphx_glr_download_auto_examples_ml_plot_kernel_regression.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/GAA-UAM/scikit-fda/develop?filepath=examples/ml/plot_kernel_regression.py :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_kernel_regression.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_kernel_regression.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_kernel_regression.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_