WEBVTT
NOTE Copyright (c) GoSkills Ltd, 2013 - 2019
00:00:04.823 --> 00:00:06.510
Hello, I'm Ray Sheen.
00:00:06.510 --> 00:00:07.730
Sometimes the dependent or
00:00:07.730 --> 00:00:11.530
response variable in the process depends upon more than one factor.
00:00:11.530 --> 00:00:15.380
When that happens, you need to do a multi-linear regression analysis.
00:00:17.170 --> 00:00:20.930
Once again, I'll start with our decision tree for hypothesis testing.
00:00:20.930 --> 00:00:24.240
When we have continuous variables for the process response, and
00:00:24.240 --> 00:00:27.580
the process independent variables, we turn to regression.
00:00:27.580 --> 00:00:30.940
And when we want to analyze multiple variables at the same time,
00:00:30.940 --> 00:00:33.480
we use the multiple regression technique.
00:00:33.480 --> 00:00:37.672
Let's take a few minutes to explain what we mean by multiple regression analysis.
00:00:37.672 --> 00:00:43.130
Recall that regression analysis determines the relationship between process variable.
00:00:43.130 --> 00:00:46.750
And it is no surprise that multiple regression considers multiple
00:00:46.750 --> 00:00:49.100
independent variables instead of just one.
00:00:49.100 --> 00:00:53.844
The analysis determines the impact that each of the independent variables has on
00:00:53.844 --> 00:00:55.389
the dependent variable.
00:00:55.389 --> 00:00:59.294
It will determine the relative significance of each of the factors to
00:00:59.294 --> 00:01:02.201
each other, in addition to the dependent factor.
00:01:02.201 --> 00:01:06.857
The form of the equation is the dependent variable is equal to a constant,
00:01:06.857 --> 00:01:10.560
which is referred to, in this equation, with beta zero.
00:01:10.560 --> 00:01:14.790
And then each of the terms with their appropriate scaling factor.
00:01:14.790 --> 00:01:20.360
So we see beta one times variable one, beta two times variable two and so on.
00:01:21.870 --> 00:01:24.780
Multiple regression analysis is particularly useful for
00:01:24.780 --> 00:01:27.410
predicting process performance.
00:01:27.410 --> 00:01:31.040
The multiple regression analysis will result in an equation that relates
00:01:31.040 --> 00:01:35.190
all of the independent variables to the dependent or response variable.
00:01:35.190 --> 00:01:39.080
This equation is incredibly helpful when you're designing a solution for
00:01:39.080 --> 00:01:41.340
a problem in a Lean Six Sigma project.
00:01:41.340 --> 00:01:44.180
The equation predicts the dependent variable performance
00:01:44.180 --> 00:01:48.010
based upon whatever values have been selected for the independent variable.
00:01:48.010 --> 00:01:52.760
So when designing the solution, determine the ideal process performance.
00:01:52.760 --> 00:01:55.810
Then determine what independent variable settings are needed
00:01:55.810 --> 00:01:57.830
to achieve that performance.
00:01:57.830 --> 00:02:00.690
Based upon the scaling constant for each of the factors,
00:02:00.690 --> 00:02:05.290
you can also decide which factors will be the primary control for the process.
00:02:05.290 --> 00:02:09.250
I prefer to use one easily controlled independent factor to control the overall
00:02:09.250 --> 00:02:10.410
process.
00:02:10.410 --> 00:02:13.370
And if possible, set the other factors in zones
00:02:13.370 --> 00:02:16.560
that are very easy to lock in to a standard setting.
00:02:16.560 --> 00:02:17.970
You can't always do that,
00:02:17.970 --> 00:02:21.320
but it does make the process control much easier when you can.
00:02:22.320 --> 00:02:26.210
So let's look at how we conduct a multiple regression analysis.
00:02:26.210 --> 00:02:27.940
Excel does not have a function for
00:02:27.940 --> 00:02:30.680
conducting a multiple linear regression analysis.
00:02:30.680 --> 00:02:33.310
So I'll just focus on the Minitab approach.
00:02:33.310 --> 00:02:37.480
In Minitab, go to the Stat pull-down menu, select Regression.
00:02:37.480 --> 00:02:39.170
Select Regression again, and
00:02:39.170 --> 00:02:43.840
then select Fit Regression model, just like is shown here.
00:02:43.840 --> 00:02:45.520
That will bring up this panel.
00:02:45.520 --> 00:02:49.930
Place your cursor in the response window to activate the list of data columns
00:02:49.930 --> 00:02:51.460
in the window on the left.
00:02:51.460 --> 00:02:55.820
Then select the dependent variable, often referred to as the y factor, and
00:02:55.820 --> 00:02:57.760
click on the select button.
00:02:57.760 --> 00:03:00.880
That column name should now move to the response window.
00:03:00.880 --> 00:03:04.110
Now, place your cursor in the continuous predictors and
00:03:04.110 --> 00:03:06.140
then select the appropriate columns.
00:03:06.140 --> 00:03:09.960
You can also use categorical or discrete factors.
00:03:09.960 --> 00:03:14.070
If you have them, however, if using this type of factor I recommend that you
00:03:14.070 --> 00:03:18.530
always use factors that are bimodal such as a true false.
00:03:18.530 --> 00:03:22.620
And set one of those criteria to one and the other to a zero.
00:03:22.620 --> 00:03:27.120
And one more point, you can get the residual plots by selecting the graphs
00:03:27.120 --> 00:03:31.020
button and then choosing residual for n1.
00:03:31.020 --> 00:03:34.400
Let's finish off this topic with a few warnings about some pitfalls
00:03:34.400 --> 00:03:36.980
when doing multiple regression analysis.
00:03:36.980 --> 00:03:40.270
This analysis still assumes linear effects,
00:03:40.270 --> 00:03:43.820
which means straight line effects for each of the independent variables.
00:03:43.820 --> 00:03:46.750
We'll look at interactive effects when we talk about non-linear
00:03:46.750 --> 00:03:49.220
regression in another class.
00:03:49.220 --> 00:03:52.925
Adding lots of independent variables can increase uncertainty.
00:03:52.925 --> 00:03:55.675
If you find that some factor has virtually no effect,
00:03:55.675 --> 00:03:59.475
I would remove it from the analysis just to simplify things.
00:03:59.475 --> 00:04:04.445
The no effect will be indicated by a very small beta value for that factor.
00:04:04.445 --> 00:04:08.395
Too many factors creates too many potential interactions and it becomes
00:04:08.395 --> 00:04:13.050
difficult to statistically validate the effect of each independent variable.
00:04:13.050 --> 00:04:17.840
A good rule of thumb is that your dataset size should be at least 10 times
00:04:17.840 --> 00:04:20.600
the number of independent factors being analyzed.
00:04:20.600 --> 00:04:23.220
So if you want to analyze four factors at once,
00:04:23.220 --> 00:04:26.580
the dataset needs to have a minimum of 40 points.
00:04:26.580 --> 00:04:29.770
Also, when there are many independent factors in the analysis,
00:04:29.770 --> 00:04:33.080
the regression formula becomes much more sensitive to outliers.
00:04:34.770 --> 00:04:35.550
In many cases,
00:04:35.550 --> 00:04:40.590
the multiple linear regression analysis is just what you need to understand
00:04:40.590 --> 00:04:46.220
the handful of independent variables that are affecting the overall process output.
00:04:46.220 --> 00:04:50.960
That formula that's created is also very helpful when designing the solution
00:04:50.960 --> 00:04:51.620
for your problem.
NOTE Copyright (c) GoSkills Ltd, 2013 - 2019