.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/ml/plot_kernel_regression.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_ml_plot_kernel_regression.py>`
        to download the full example code. or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_ml_plot_kernel_regression.py:


Kernel Regression
=================

In this example we will see and compare the performance of different kernel
regression methods.

.. GENERATED FROM PYTHON SOURCE LINES 8-12

.. code-block:: Python


    # Author: Elena Petrunina
    # License: MIT


.. GENERATED FROM PYTHON SOURCE LINES 13-17

For this example, we will use the
:func:`tecator <skfda.datasets.fetch_tecator>` dataset. This data set
contains 215 samples. For each sample the data consists of a spectrum of
absorbances and the contents of water, fat and protein.

.. GENERATED FROM PYTHON SOURCE LINES 17-24

.. code-block:: Python


    from skfda.datasets import fetch_tecator

    X_df, y_df = fetch_tecator(return_X_y=True, as_frame=True)
    X = X_df.iloc[:, 0].array
    fat = y_df["fat"].to_numpy()


.. GENERATED FROM PYTHON SOURCE LINES 30-33

Fat percentages will be estimated from the spectrum.
All curves are shown in the image above. The color of these depends on the
amount of fat, from least (yellow) to highest (red).

.. GENERATED FROM PYTHON SOURCE LINES 33-39

.. code-block:: Python


    import matplotlib.pyplot as plt

    X.plot(gradient_criteria=fat, legend=True)
    plt.show()


.. image-sg:: /auto_examples/ml/images/sphx_glr_plot_kernel_regression_001.png
   :alt: Spectrometric curves
   :srcset: /auto_examples/ml/images/sphx_glr_plot_kernel_regression_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 40-42

The data set is splitted into train and test sets with 80% and 20% of the
samples respectively.

.. GENERATED FROM PYTHON SOURCE LINES 42-52

.. code-block:: Python


    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        fat,
        test_size=0.2,
        random_state=1,
    )


.. GENERATED FROM PYTHON SOURCE LINES 53-56

The KNN hat matrix will be tried first. We will use the default kernel
function, i.e. uniform kernel. To find the most suitable number of
neighbours GridSearchCV will be used, testing with any number from 1 to 100.

.. GENERATED FROM PYTHON SOURCE LINES 56-70

.. code-block:: Python


    import numpy as np
    from sklearn.model_selection import GridSearchCV

    from skfda.misc.hat_matrix import KNeighborsHatMatrix
    from skfda.ml.regression._kernel_regression import KernelRegression

    n_neighbors = np.array(range(1, 100))
    knn = GridSearchCV(
        KernelRegression(kernel_estimator=KNeighborsHatMatrix()),
        param_grid={"kernel_estimator__n_neighbors": n_neighbors},
    )


.. GENERATED FROM PYTHON SOURCE LINES 71-73

The best performance for the train set is obtained with the following number
of neighbours

.. GENERATED FROM PYTHON SOURCE LINES 73-80

.. code-block:: Python


    knn.fit(X_train, y_train)
    print(
        "KNN bandwidth:",
        knn.best_params_["kernel_estimator__n_neighbors"],
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    KNN bandwidth: 3


.. GENERATED FROM PYTHON SOURCE LINES 81-83

The accuracy of the estimation using r2_score measurement on the test set is
shown below.

.. GENERATED FROM PYTHON SOURCE LINES 83-91

.. code-block:: Python


    from sklearn.metrics import r2_score

    y_pred = knn.predict(X_test)
    knn_res = r2_score(y_pred, y_test)
    print("Score KNN:", knn_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Score KNN: 0.3500795818805428


.. GENERATED FROM PYTHON SOURCE LINES 92-94

Following a similar procedure for Nadaraya-Watson, the optimal parameter is
chosen from the interval (0.01, 1).

.. GENERATED FROM PYTHON SOURCE LINES 94-103

.. code-block:: Python


    from skfda.misc.hat_matrix import NadarayaWatsonHatMatrix

    bandwidth = np.logspace(-2, 0, num=100)
    nw = GridSearchCV(
        KernelRegression(kernel_estimator=NadarayaWatsonHatMatrix()),
        param_grid={"kernel_estimator__bandwidth": bandwidth},
    )


.. GENERATED FROM PYTHON SOURCE LINES 104-105

The best performance is obtained with the following bandwidth

.. GENERATED FROM PYTHON SOURCE LINES 105-112

.. code-block:: Python


    nw.fit(X_train, y_train)
    print(
        "Nadaraya-Watson bandwidth:",
        nw.best_params_["kernel_estimator__bandwidth"],
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Nadaraya-Watson bandwidth: 0.37649358067924693


.. GENERATED FROM PYTHON SOURCE LINES 113-115

The accuracy of the estimation is shown below and should be similar to that
obtained with the KNN method.

.. GENERATED FROM PYTHON SOURCE LINES 115-120

.. code-block:: Python


    y_pred = nw.predict(X_test)
    nw_res = r2_score(y_pred, y_test)
    print("Score NW:", nw_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Score NW: 0.3127155617541538


.. GENERATED FROM PYTHON SOURCE LINES 121-128

For Local Linear Regression, FDataBasis representation with a basis should be
used (for the previous cases it is possible to use either
FDataGrid or FDataBasis).

For basis, Fourier basis with 10 elements has been selected. Note that the
number of functions in the basis affects the estimation result and should
ideally also be chosen using cross-validation.

.. GENERATED FROM PYTHON SOURCE LINES 128-150

.. code-block:: Python


    from skfda.misc.hat_matrix import LocalLinearRegressionHatMatrix
    from skfda.representation.basis import FourierBasis

    fourier = FourierBasis(n_basis=10)

    X_basis = X.to_basis(basis=fourier)
    X_basis_train, X_basis_test, y_train, y_test = train_test_split(
        X_basis,
        fat,
        test_size=0.2,
        random_state=1,
    )


    bandwidth = np.logspace(0.3, 1, num=100)

    llr = GridSearchCV(
        KernelRegression(kernel_estimator=LocalLinearRegressionHatMatrix()),
        param_grid={"kernel_estimator__bandwidth": bandwidth},
    )


.. GENERATED FROM PYTHON SOURCE LINES 151-152

The bandwidth obtained by cross-validation is indicated below.

.. GENERATED FROM PYTHON SOURCE LINES 152-158

.. code-block:: Python

    llr.fit(X_basis_train, y_train)
    print(
        "LLR bandwidth:",
        llr.best_params_["kernel_estimator__bandwidth"],
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    LLR bandwidth: 4.728762199830451


.. GENERATED FROM PYTHON SOURCE LINES 159-161

Although it is a slower method, the result obtained in this example should be
better than in the case of Nadaraya-Watson and KNN.

.. GENERATED FROM PYTHON SOURCE LINES 161-166

.. code-block:: Python


    y_pred = llr.predict(X_basis_test)
    llr_res = r2_score(y_pred, y_test)
    print("Score LLR:", llr_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Score LLR: 0.9731955244187162


.. GENERATED FROM PYTHON SOURCE LINES 167-171

For this data set using the derivative should give a better performance.

Below the plot of all the derivatives can be found. The same scheme as before
is followed: yellow les fat, red more.

.. GENERATED FROM PYTHON SOURCE LINES 171-183

.. code-block:: Python


    Xd = X.derivative()
    Xd.plot(gradient_criteria=fat, legend=True)
    plt.show()

    Xd_train, Xd_test, y_train, y_test = train_test_split(
        Xd,
        fat,
        test_size=0.2,
        random_state=1,
    )


.. image-sg:: /auto_examples/ml/images/sphx_glr_plot_kernel_regression_002.png
   :alt: Spectrometric curves
   :srcset: /auto_examples/ml/images/sphx_glr_plot_kernel_regression_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 184-186

Exactly the same operations are repeated, but now with the derivatives of the
functions.

.. GENERATED FROM PYTHON SOURCE LINES 188-189

K-Nearest Neighbours

.. GENERATED FROM PYTHON SOURCE LINES 189-205

.. code-block:: Python

    knn = GridSearchCV(
        KernelRegression(kernel_estimator=KNeighborsHatMatrix()),
        param_grid={"kernel_estimator__n_neighbors": n_neighbors},
    )

    knn.fit(Xd_train, y_train)
    print(
        "KNN bandwidth:",
        knn.best_params_["kernel_estimator__n_neighbors"],
    )

    y_pred = knn.predict(Xd_test)
    dknn_res = r2_score(y_pred, y_test)
    print("Score KNN:", dknn_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    KNN bandwidth: 4
    Score KNN: 0.9428247359478524


.. GENERATED FROM PYTHON SOURCE LINES 206-207

Nadaraya-Watson

.. GENERATED FROM PYTHON SOURCE LINES 207-223

.. code-block:: Python

    bandwidth = np.logspace(-3, -1, num=100)
    nw = GridSearchCV(
        KernelRegression(kernel_estimator=NadarayaWatsonHatMatrix()),
        param_grid={"kernel_estimator__bandwidth": bandwidth},
    )

    nw.fit(Xd_train, y_train)
    print(
        "Nadara-Watson bandwidth:",
        nw.best_params_["kernel_estimator__bandwidth"],
    )

    y_pred = nw.predict(Xd_test)
    dnw_res = r2_score(y_pred, y_test)
    print("Score NW:", dnw_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Nadara-Watson bandwidth: 0.006135907273413175
    Score NW: 0.9491787548158307


.. GENERATED FROM PYTHON SOURCE LINES 224-226

For both Nadaraya-Watson and KNN the accuracy has improved significantly
and should be higher than 0.9.

.. GENERATED FROM PYTHON SOURCE LINES 228-229

Local Linear Regression

.. GENERATED FROM PYTHON SOURCE LINES 229-253

.. code-block:: Python

    Xd_basis = Xd.to_basis(basis=fourier)
    Xd_basis_train, Xd_basis_test, y_train, y_test = train_test_split(
        Xd_basis,
        fat,
        test_size=0.2,
        random_state=1,
    )

    bandwidth = np.logspace(-2, 1, 100)
    llr = GridSearchCV(
        KernelRegression(kernel_estimator=LocalLinearRegressionHatMatrix()),
        param_grid={"kernel_estimator__bandwidth": bandwidth},
    )

    llr.fit(Xd_basis_train, y_train)
    print(
        "LLR bandwidth:",
        llr.best_params_["kernel_estimator__bandwidth"],
    )

    y_pred = llr.predict(Xd_basis_test)
    dllr_res = r2_score(y_pred, y_test)
    print("Score LLR:", dllr_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    LLR bandwidth: 0.010722672220103232
    Score LLR: 0.9949460304758446


.. GENERATED FROM PYTHON SOURCE LINES 254-256

LLR accuracy has also improved, but the difference with Nadaraya-Watson and
KNN in the case of derivatives is less significant than in the previous case.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 6.265 seconds)


.. _sphx_glr_download_auto_examples_ml_plot_kernel_regression.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/GAA-UAM/scikit-fda/develop?filepath=examples/ml/plot_kernel_regression.py
        :alt: Launch binder
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_kernel_regression.ipynb <plot_kernel_regression.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_kernel_regression.py <plot_kernel_regression.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_kernel_regression.zip <plot_kernel_regression.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_