{ "metadata": { "anaconda-cloud": {}, "kernelspec": { "name": "python", "display_name": "Pyolite", "language": "python" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8" }, "metadata": { "interpreter": { "hash": "ac2eaa0ea0ebeafcc7822e65e46aa9d4f966f30b695406963e145ea4a91cd4fc" } } }, "nbformat_minor": 4, "nbformat": 4, "cells": [ { "cell_type": "markdown", "source": "
\n | Unnamed: 0 | \nUnnamed: 0.1 | \nsymboling | \nnormalized-losses | \nwheel-base | \nlength | \nwidth | \nheight | \ncurb-weight | \nengine-size | \n... | \nstroke | \ncompression-ratio | \nhorsepower | \npeak-rpm | \ncity-mpg | \nhighway-mpg | \nprice | \ncity-L/100km | \ndiesel | \ngas | \n
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n0 | \n0 | \n3 | \n122 | \n88.6 | \n0.811148 | \n0.890278 | \n48.8 | \n2548 | \n130 | \n... | \n2.68 | \n9.0 | \n111.0 | \n5000.0 | \n21 | \n27 | \n13495.0 | \n11.190476 | \n0 | \n1 | \n
1 | \n1 | \n1 | \n3 | \n122 | \n88.6 | \n0.811148 | \n0.890278 | \n48.8 | \n2548 | \n130 | \n... | \n2.68 | \n9.0 | \n111.0 | \n5000.0 | \n21 | \n27 | \n16500.0 | \n11.190476 | \n0 | \n1 | \n
2 | \n2 | \n2 | \n1 | \n122 | \n94.5 | \n0.822681 | \n0.909722 | \n52.4 | \n2823 | \n152 | \n... | \n3.47 | \n9.0 | \n154.0 | \n5000.0 | \n19 | \n26 | \n16500.0 | \n12.368421 | \n0 | \n1 | \n
3 | \n3 | \n3 | \n2 | \n164 | \n99.8 | \n0.848630 | \n0.919444 | \n54.3 | \n2337 | \n109 | \n... | \n3.40 | \n10.0 | \n102.0 | \n5500.0 | \n24 | \n30 | \n13950.0 | \n9.791667 | \n0 | \n1 | \n
4 | \n4 | \n4 | \n2 | \n164 | \n99.4 | \n0.848630 | \n0.922222 | \n54.3 | \n2824 | \n136 | \n... | \n3.40 | \n8.0 | \n115.0 | \n5500.0 | \n18 | \n22 | \n17450.0 | \n13.055556 | \n0 | \n1 | \n
5 rows × 21 columns
\nAn important step in testing your model is to split your data into training and testing data. We will place the target data price in a separate dataframe y_data:
\n", "metadata": {} }, { "cell_type": "code", "source": "y_data = df['price']", "metadata": { "trusted": true }, "execution_count": 13, "outputs": [] }, { "cell_type": "markdown", "source": "Drop price data in dataframe **x_data**:\n", "metadata": {} }, { "cell_type": "code", "source": "x_data=df.drop('price',axis=1)", "metadata": { "trusted": true }, "execution_count": 14, "outputs": [] }, { "cell_type": "markdown", "source": "Now, we randomly split our data into training and testing data using the function train_test_split.\n", "metadata": {} }, { "cell_type": "code", "source": "from sklearn.model_selection import train_test_split\n\n\nx_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=1)\n\n\nprint(\"number of test samples :\", x_test.shape[0])\nprint(\"number of training samples:\",x_train.shape[0])\n", "metadata": { "trusted": true }, "execution_count": 15, "outputs": [ { "name": "stdout", "text": "number of test samples : 21\nnumber of training samples: 180\n", "output_type": "stream" } ] }, { "cell_type": "markdown", "source": "The test_size parameter sets the proportion of data that is split into the testing set. In the above, the testing set is 10% of the total dataset.\n", "metadata": {} }, { "cell_type": "markdown", "source": "It turns out that the test data, sometimes referred to as the \"out of sample data\", is a much better measure of how well your model performs in the real world. One reason for this is overfitting.\n\nLet's go over some examples. It turns out these differences are more apparent in Multiple Linear Regression and Polynomial Regression so we will explore overfitting in that context.
\n", "metadata": {} }, { "cell_type": "markdown", "source": "Let's create Multiple Linear Regression objects and train the model using 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg' as features.\n", "metadata": {} }, { "cell_type": "code", "source": "lr = LinearRegression()\nlr.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_train)", "metadata": { "trusted": true }, "execution_count": 33, "outputs": [ { "execution_count": 33, "output_type": "execute_result", "data": { "text/plain": "LinearRegression()" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "Prediction using training data:\n", "metadata": {} }, { "cell_type": "code", "source": "yhat_train = lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])\nyhat_train[0:5]", "metadata": { "trusted": true }, "execution_count": 34, "outputs": [ { "execution_count": 34, "output_type": "execute_result", "data": { "text/plain": "array([ 7426.6731551 , 28323.75090803, 14213.38819709, 4052.34146983,\n 34500.19124244])" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "Prediction using test data:\n", "metadata": {} }, { "cell_type": "code", "source": "yhat_test = lr.predict(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])\nyhat_test[0:5]", "metadata": { "trusted": true }, "execution_count": 35, "outputs": [ { "execution_count": 35, "output_type": "execute_result", "data": { "text/plain": "array([11349.35089149, 5884.11059106, 11208.6928275 , 6641.07786278,\n 15565.79920282])" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "Let's perform some model evaluation using our training and testing data separately. First, we import the seaborn and matplotlib library for plotting.\n", "metadata": {} }, { "cell_type": "code", "source": "import matplotlib.pyplot as plt\n%matplotlib inline\nimport seaborn as sns", "metadata": { "trusted": true }, "execution_count": 36, "outputs": [] }, { "cell_type": "markdown", "source": "Let's examine the distribution of the predicted values of the training data.\n", "metadata": {} }, { "cell_type": "code", "source": "Title = 'Distribution Plot of Predicted Value Using Training Data vs Training Data Distribution'\nDistributionPlot(y_train, yhat_train, \"Actual Values (Train)\", \"Predicted Values (Train)\", Title)", "metadata": { "trusted": true }, "execution_count": 37, "outputs": [ { "name": "stderr", "text": "/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).\n warnings.warn(msg, FutureWarning)\n/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).\n warnings.warn(msg, FutureWarning)\n", "output_type": "stream" }, { "output_type": "display_data", "data": { "text/plain": "Comparing Figure 1 and Figure 2, it is evident that the distribution of the test data in Figure 1 is much better at fitting the data. This difference in Figure 2 is apparent in the range of 5000 to 15,000. This is where the shape of the distribution is extremely different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset.
\n", "metadata": {} }, { "cell_type": "code", "source": "from sklearn.preprocessing import PolynomialFeatures", "metadata": { "trusted": true }, "execution_count": 39, "outputs": [] }, { "cell_type": "markdown", "source": "Overfitting occurs when the model fits the noise, but not the underlying process. Therefore, when testing your model using the test set, your model does not perform as well since it is modelling noise, not the underlying process that generated the relationship. Let's create a degree 5 polynomial model.
\n", "metadata": {} }, { "cell_type": "markdown", "source": "Let's use 55 percent of the data for training and the rest for testing:\n", "metadata": {} }, { "cell_type": "code", "source": "x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)", "metadata": { "trusted": true }, "execution_count": 40, "outputs": [] }, { "cell_type": "markdown", "source": "We will perform a degree 5 polynomial transformation on the feature 'horsepower'.\n", "metadata": {} }, { "cell_type": "code", "source": "pr = PolynomialFeatures(degree=5)\nx_train_pr = pr.fit_transform(x_train[['horsepower']])\nx_test_pr = pr.fit_transform(x_test[['horsepower']])\npr", "metadata": { "trusted": true }, "execution_count": 41, "outputs": [ { "execution_count": 41, "output_type": "execute_result", "data": { "text/plain": "PolynomialFeatures(degree=5)" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "Now, let's create a Linear Regression model \"poly\" and train it.\n", "metadata": {} }, { "cell_type": "code", "source": "poly = LinearRegression()\npoly.fit(x_train_pr, y_train)", "metadata": { "trusted": true }, "execution_count": 42, "outputs": [ { "execution_count": 42, "output_type": "execute_result", "data": { "text/plain": "LinearRegression()" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "We can see the output of our model using the method \"predict.\" We assign the values to \"yhat\".\n", "metadata": {} }, { "cell_type": "code", "source": "yhat = poly.predict(x_test_pr)\nyhat[0:5]", "metadata": { "trusted": true }, "execution_count": 43, "outputs": [ { "execution_count": 43, "output_type": "execute_result", "data": { "text/plain": "array([ 6728.58641321, 7307.91998787, 12213.73753589, 18893.37919224,\n 19996.10612156])" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "Let's take the first five predicted values and compare it to the actual targets.\n", "metadata": {} }, { "cell_type": "code", "source": "print(\"Predicted values:\", yhat[0:4])\nprint(\"True values:\", y_test[0:4].values)", "metadata": { "trusted": true }, "execution_count": 44, "outputs": [ { "name": "stdout", "text": "Predicted values: [ 6728.58641321 7307.91998787 12213.73753589 18893.37919224]\nTrue values: [ 6295. 10698. 13860. 13499.]\n", "output_type": "stream" } ] }, { "cell_type": "markdown", "source": "We will use the function \"PollyPlot\" that we defined at the beginning of the lab to display the training data, testing data, and the predicted function.\n", "metadata": {} }, { "cell_type": "code", "source": "PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train, y_test, poly,pr)", "metadata": { "trusted": true }, "execution_count": 45, "outputs": [ { "output_type": "display_data", "data": { "text/plain": "