🥳 GOSKILLS TURNS 10: Get 10 days of free access with code 10YEARS

GoSkills
Help Sign up Share
Back to course

Split Data into Training and Testing Set

Compact player layout Large player layout

Locked lesson.

Upgrade

  • Lesson resourcesResources
  • Quick referenceReference
  • Transcript
  • Notes

About this lesson

As we begin to set up our linear regression model, we must define testing and training splits.

Exercise files

Download this lesson’s related exercise files.

Split Data into Training and Testing Set.docx
57 KB
Split Data into Training and Testing Set - Solution.docx
56.1 KB

Quick reference

Split Data into Training and Testing Set

When to use

Before running any linear regression, you'll need to designate an X, a y, and a Train/Test Split.

Instructions

First, we need to import a couple of things into Jupyter Notebook:

   from sklearn.linear_model import LinearRegression
   from sklearn.model_selection import train_test_split

Next we need to designate our X and our y:

   X = bost[bost.columns]
   y = pd.DataFrame(boston.target, columns=['Price'])

Finally, we need to designate which of our data will be test data and which will be training data for our model:

   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=10)

Hints & tips

  • Import LinearRegression
  • Import the Train Test Split model
  • Set our X and y
  • Designate what we want to train and test
Login to download
  • 00:05 Okay, so in this video, we want to start to set up our linear regression model and
  • 00:09 we want to set up the testing and training split.
  • 00:11 And I'll talk about what that means in just a second.
  • 00:13 So first thing's first, we need to add a couple more things up here.
  • 00:17 So from sklearn.linear_model, we want to import linear regression.
  • 00:22 This is just the machine learning program of linear regression that allows us to do
  • 00:26 all of our linear regression stuff.
  • 00:28 So then we also want to from sklearn.model_selection
  • 00:33 import train test split.
  • 00:35 And I'll talk about train test split in just a second.
  • 00:38 So first things first, we need to set up some data,
  • 00:40 we need to define the data that we're going to be using in our model.
  • 00:44 And remember when we talked about linear regression,
  • 00:47 I showed you it look like a scatterplot.
  • 00:49 It has an x and a y axis.
  • 00:50 So we need to define our x and our y.
  • 00:53 And x is usually designated with linear regression as capital X,
  • 00:58 and then y is usually lowercase y.
  • 01:01 So for X, we need to determine what features we want to train on.
  • 01:06 And the y is the target variable.
  • 01:08 So our y is going to be our house prices.
  • 01:10 We want to determine what prices are going to be in the future based on
  • 01:15 certain things.
  • 01:16 The X is the certain things, right?
  • 01:18 So, remember our headers, all those feature names?
  • 01:22 We're going to say if an increase in something of one of those features,
  • 01:26 what should we expect the price to increase or decrease because of that?
  • 01:30 So our X is just going to be our bost data, but
  • 01:34 we want to designate all of the columns.
  • 01:39 So It's just bost.columns.
  • 01:42 And if you're interested, if you don't remember what bost.columns are,
  • 01:46 we could just run this, and it's just our headers, right?
  • 01:49 So these are the features we're going to test against.
  • 01:55 Our y is going to be our price data.
  • 01:57 Remember that target stuff.
  • 01:58 So let's just create a data frame out of this.
  • 02:00 So let's go pd.dataframe.
  • 02:04 Our boston.target.
  • 02:07 We haven't actually done anything with it yet.
  • 02:08 So we'll just leave it as boston.target.
  • 02:10 And for columns, we can designate anything we want as a column header.
  • 02:15 Let's just call it price because that's what it is.
  • 02:18 We can actually just run our y here to see these are our prices in thousands.
  • 02:24 And that's just this boston.target that we looked at earlier.
  • 02:27 So now we have our X and our y.
  • 02:29 Now we need to set up our training data in our testing data.
  • 02:32 And that's where this train test split comes into play.
  • 02:35 Our test set serves sort of as a proxy for new data.
  • 02:39 And then our train data is the data on which we're going to apply the linear
  • 02:44 regression algorithm, right?
  • 02:46 So we have to designate what parts of our data are testing parts and
  • 02:50 what are training parts.
  • 02:52 So we're going to train our our model, and we're going to test against it.
  • 02:55 So remember when we looked at this earlier, there's 500 and
  • 02:59 something odd records.
  • 03:00 We have to designate which of those we want to be testing and
  • 03:03 which of those we want to be training against.
  • 03:05 So to do that, we use our X and our y data.
  • 03:09 And I'm going to paste this in.
  • 03:11 It's X_train, this is the data we're going to train, and
  • 03:15 X_test, this is the X data we're going to test,
  • 03:19 versus the y data we're going to train and the y data we're going to test.
  • 03:24 And those are going to equal, and these are just tuple unpacking,
  • 03:27 train_test_split, which is this thing.
  • 03:29 We're going to say test ( X, y).
  • 03:33 And then our test size, because these are our data that we designated here.
  • 03:39 Now the test size, this is going to tell us how much of those 506 odd rows
  • 03:44 are going to be testing data and how much of them are going to be training data.
  • 03:50 So I've put 0.4, so 40% of them are going to be testing.
  • 03:54 Sometimes people put 30%, it really doesn't matter.
  • 03:57 I'm just going to put 40%.
  • 03:59 And then this random state allows us to randomly select which of those
  • 04:03 are going to be in the 40% test size and
  • 04:05 which of them are going to be in the 60% train size.
  • 04:08 So I'm going to put 10.
  • 04:09 You might want to put 10 if you want your data to look like mine.
  • 04:12 You can put anything here.
  • 04:13 You can put 101 if you want.
  • 04:15 It's sort of like a random seed generator basically.
  • 04:18 We're pretty much ready to go.
  • 04:19 We now define how we want to train and test this thing.
  • 04:22 So in the next video, we'll train our linear regression model.

Lesson notes are only available for subscribers.

Linear Regression Installation
05m:46s
Train a Linear Regression Model and Fit the Model
03m:47s
Share this lesson and earn rewards

Facebook Twitter LinkedIn WhatsApp Email

Gift this course
Give feedback

How is your GoSkills experience?

I need help

Your feedback has been sent

Thank you

Back to the top

© 2023 GoSkills Ltd. Skills for career advancement