-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathen_Simple Linear Regression.srt
524 lines (393 loc) · 13.6 KB
/
en_Simple Linear Regression.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
0
00:00:00,510 --> 00:00:05,280
Hello, and welcome! In this video, we’ll be covering linear regression.
1
00:00:05,280 --> 00:00:12,080
You don’t need to know any linear algebra to understand topics in linear regression.
2
00:00:12,080 --> 00:00:17,000
This high-level introduction will give you enough background information on linear regression
3
00:00:17,000 --> 00:00:20,380
to be able to use it effectively on your own problems.
4
00:00:20,380 --> 00:00:22,900
So, let’s get started.
5
00:00:22,900 --> 00:00:28,250
Let’s take a look at this dataset. It’s related to the Co2 emission of different
6
00:00:28,250 --> 00:00:33,210
cars. It includes Engine size, Cylinders, Fuel Consumption
7
00:00:33,210 --> 00:00:40,739
and Co2 emissions for various car models. The question is: Given this dataset, can we
8
00:00:40,739 --> 00:00:46,039
predict the Co2 emission of a car, using another field, such as Engine size?
9
00:00:46,039 --> 00:00:50,289
Quite simply, yes! We can use linear regression to predict a
10
00:00:50,289 --> 00:00:56,760
continuous value such as Co2 Emission, by using other variables.
11
00:00:56,760 --> 00:01:01,989
Linear regression is the approximation of a linear model used to describe the relationship
12
00:01:01,989 --> 00:01:07,830
between two or more variables. In simple linear regression, there are two
13
00:01:07,830 --> 00:01:13,350
variables: a dependent variable and an independent variable.
14
00:01:13,350 --> 00:01:18,480
The key point in the linear regression is that our dependent value should be continuous
15
00:01:18,480 --> 00:01:23,840
and cannot be a discreet value. However, the independent variable(s) can be
16
00:01:23,840 --> 00:01:28,810
measured on either a categorical or continuous measurement scale.
17
00:01:28,810 --> 00:01:36,420
There are two types of linear regression models. They are: simple regression and multiple regression.
18
00:01:36,420 --> 00:01:41,580
Simple linear regression is when one independent variable is used to estimate
19
00:01:41,580 --> 00:01:46,370
a dependent variable. For example, predicting Co2 emission using
20
00:01:46,370 --> 00:01:51,480
the EngineSize variable. When more than one independent variable is
21
00:01:51,480 --> 00:01:55,140
present, the process is called multiple linear regression.
22
00:01:55,140 --> 00:02:01,440
For example, predicting Co2 emission using EngineSize and Cylinders of cars.
23
00:02:01,440 --> 00:02:05,090
Our focus in this video is on simple linear regression.
24
00:02:05,090 --> 00:02:13,400
Now, let’s see how linear regression works. OK, so let’s look at our dataset again.
25
00:02:13,400 --> 00:02:18,019
To understand linear regression, we can plot our variables here.
26
00:02:18,019 --> 00:02:23,900
We show Engine size as an independent variable, and Emission as the target value that we would
27
00:02:23,900 --> 00:02:28,200
like to predict. A scatterplot clearly shows the relation between
28
00:02:28,200 --> 00:02:35,939
variables where changes in one variable "explain" or possibly "cause" changes in the other variable.
29
00:02:35,939 --> 00:02:41,330
Also, it indicates that these variables are linearly related.
30
00:02:41,330 --> 00:02:45,440
With linear regression you can fit a line through the data.
31
00:02:45,440 --> 00:02:50,849
For instance, as the EngineSize increases, so do the emissions.
32
00:02:50,849 --> 00:02:54,840
With linear regression, you can model the relationship of these variables.
33
00:02:54,840 --> 00:03:01,290
A good model can be used to predict what the approximate emission of each car is.
34
00:03:01,290 --> 00:03:07,060
How do we use this line for prediction now? Let us assume, for a moment, that the line
35
00:03:07,060 --> 00:03:11,290
is a good fit of data. We can use it to predict the emission of an
36
00:03:11,290 --> 00:03:15,940
unknown car. For example, for a sample car, with engine
37
00:03:15,940 --> 00:03:20,640
size 2.4, you can find the emission is 214.
38
00:03:20,640 --> 00:03:26,269
Now, let’s talk about what this fitting line actually is.
39
00:03:26,269 --> 00:03:30,239
We’re going to predict the target value, y.
40
00:03:30,239 --> 00:03:38,160
In our case, using the independent variable, "Engine Size," represented by x1.
41
00:03:38,160 --> 00:03:46,189
The fit line is shown traditionally as a polynomial. In a simple regression problem (a single x),
42
00:03:46,189 --> 00:03:55,799
the form of the model would be θ0 +θ1 x1. In this equation, y ̂ is the dependent variable
43
00:03:55,799 --> 00:04:05,699
or the predicted value, and x1 is the independent variable; θ0 and θ1 are the parameters of
44
00:04:05,699 --> 00:04:11,730
the line that we must adjust. θ1 is known as the "slope" or "gradient"
45
00:04:11,730 --> 00:04:17,370
of the fitting line and θ0 is known as the "intercept."
46
00:04:17,370 --> 00:04:23,540
θ0 and θ1 are also called the coefficients of the linear equation.
47
00:04:23,540 --> 00:04:31,100
You can interpret this equation as y ̂ being a function of x1, or y ̂ being dependent of x1.
48
00:04:31,100 --> 00:04:35,700
Now the questions are: "How would you draw
49
00:04:35,700 --> 00:04:41,500
a line through the points?" And, "How do you determine which line ‘fits
50
00:04:41,500 --> 00:04:42,660
best’?"
51
00:04:42,660 --> 00:04:46,600
Linear regression estimates the coefficients of the line.
52
00:04:46,600 --> 00:04:54,060
This means we must calculate θ0 and θ1 to find the best line to ‘fit’ the data.
53
00:04:54,060 --> 00:04:59,600
This line would best estimate the emission of the unknown data points.
54
00:04:59,600 --> 00:05:05,220
Let’s see how we can find this line, or to be more precise, how we can adjust the
55
00:05:05,220 --> 00:05:09,230
parameters to make the line the best fit for the data.
56
00:05:09,230 --> 00:05:15,340
For a moment, let’s assume we’ve already found the best fit line for our data.
57
00:05:15,340 --> 00:05:21,660
Now, let’s go through all the points and check how well they align with this line.
58
00:05:21,660 --> 00:05:30,100
Best fit, here, means that if we have, for instance, a car with engine size x1=5.4, and
59
00:05:30,100 --> 00:05:41,990
actual Co2=250, its Co2 should be predicted very close to the actual value, which is y=250,
60
00:05:41,990 --> 00:05:44,160
based on historical data.
61
00:05:44,160 --> 00:05:51,410
But, if we use the fit line, or better to say, using our polynomial with known parameters
62
00:05:51,410 --> 00:05:57,650
to predict the Co2 emission, it will return y ̂ =340.
63
00:05:57,650 --> 00:06:04,440
Now, if you compare the actual value of the emission of the car with what we predicted
64
00:06:04,440 --> 00:06:10,030
using our model, you will find out that we have a 90-unit error.
65
00:06:10,030 --> 00:06:17,290
This means our prediction line is not accurate. This error is also called the residual error.
66
00:06:17,290 --> 00:06:24,920
So, we can say the error is the distance from the data point to the fitted regression line.
67
00:06:24,920 --> 00:06:31,190
The mean of all residual errors shows how poorly the line fits with the whole dataset.
68
00:06:31,190 --> 00:06:38,880
Mathematically, it can be shown by the equation, mean squared error, shown as (MSE).
69
00:06:38,880 --> 00:06:43,980
Our objective is to find a line where the mean of all these errors is minimized.
70
00:06:43,980 --> 00:06:49,680
In other words, the mean error of the prediction using the fit line should be minimized.
71
00:06:49,680 --> 00:06:56,330
Let’s re-word it more technically. The objective of linear regression is to minimize
72
00:06:56,330 --> 00:07:04,240
this MSE equation, and to minimize it, we should find the best parameters, θ0 and θ1.
73
00:07:04,240 --> 00:07:13,640
Now, the question is, how to find θ0 and θ1 in such a way that it minimizes this error?
74
00:07:13,640 --> 00:07:19,580
How can we find such a perfect line? Or, said another way, how should we find the
75
00:07:19,580 --> 00:07:25,250
best parameters for our line? Should we move the line a lot randomly and
76
00:07:25,250 --> 00:07:29,650
calculate the MSE value every time, and choose the minimum one?
77
00:07:29,650 --> 00:07:34,430
Not really! Actually, we have two options here:
78
00:07:34,430 --> 00:07:40,430
Option 1 - We can use a mathematic approach. Or, Option 2 - We can use an optimization
79
00:07:40,430 --> 00:07:41,490
approach.
80
00:07:41,490 --> 00:07:49,420
Let’s see how we can easily use a mathematic formula to find the θ0 and θ1.
81
00:07:49,420 --> 00:07:56,750
As mentioned before, θ0 and θ1, in the simple linear regression, are the coefficients of
82
00:07:56,750 --> 00:08:01,460
the fit line. We can use a simple equation to estimate these
83
00:08:01,460 --> 00:08:04,770
coefficients. That is, given that it’s a simple linear
84
00:08:04,770 --> 00:08:12,320
regression, with only 2 parameters, and knowing that θ0 and θ1 are the intercept and slope
85
00:08:12,320 --> 00:08:17,490
of the line, we can estimate them directly from our data.
86
00:08:17,490 --> 00:08:23,590
It requires that we calculate the mean of the independent and dependent or target columns,
87
00:08:23,590 --> 00:08:28,080
from the dataset. Notice that all of the data must be available
88
00:08:28,080 --> 00:08:34,560
to traverse and calculate the parameters. It can be shown that the intercept and slope
89
00:08:34,559 --> 00:08:40,729
can be calculated using these equations. We can start off by estimating the value for θ1.
90
00:08:40,729 --> 00:08:44,510
This is how you can find the slope of a line
91
00:08:44,510 --> 00:08:50,180
based on the data. x ̅ is the average value for the engine size
92
00:08:50,180 --> 00:08:55,990
in our dataset. Please consider that we have 9 rows here,
93
00:08:55,990 --> 00:09:01,420
row 0 to 8. First, we calculate the average of x1 and
94
00:09:01,420 --> 00:09:06,490
average of y. Then we plug it into the slope equation, to
95
00:09:06,490 --> 00:09:12,890
find θ1. The xi and yi in the equation refer to the
96
00:09:12,890 --> 00:09:20,070
fact that we need to repeat these calculations across all values in our dataset and i refers
97
00:09:20,070 --> 00:09:24,860
to the i’th value of x or y.
98
00:09:24,860 --> 00:09:32,090
Applying all values, we find θ1=39; it is our second parameter.
99
00:09:32,090 --> 00:09:37,140
It is used to calculate the first parameter, which is the intercept of the line.
100
00:09:37,140 --> 00:09:43,640
Now, we can plug θ1 into the line equation to find θ0.
101
00:09:43,640 --> 00:09:54,210
It is easily calculated that θ0=125.74. So, these are the two parameters for the line,
102
00:09:54,210 --> 00:10:02,529
where θ0 is also called the bias coefficient and θ1 is the coefficient for the Co2 Emission
103
00:10:02,529 --> 00:10:06,690
column. As a side note, you really don’t need to
104
00:10:06,690 --> 00:10:11,810
remember the formula for calculating these parameters, as most of the libraries used
105
00:10:11,810 --> 00:10:18,770
for machine learning in Python, R, and Scala can easily find these parameters for you.
106
00:10:18,770 --> 00:10:22,680
But it’s always good to understand how it works.
107
00:10:22,680 --> 00:10:27,020
Now, we can write down the polynomial of the line.
108
00:10:27,020 --> 00:10:32,320
So, we know how to find the best fit for our data, and its equation.
109
00:10:32,320 --> 00:10:38,150
Now the question is: "How can we use it to predict the emission of a new car based on
110
00:10:38,150 --> 00:10:40,580
its engine size?"
111
00:10:40,580 --> 00:10:45,700
After we found the parameters of the linear equation, making predictions is as simple
112
00:10:45,700 --> 00:10:50,000
as solving the equation for a specific set of inputs.
113
00:10:50,000 --> 00:10:57,750
Imagine we are predicting Co2 Emission(y) from EngineSize(x) for the Automobile in record
114
00:10:57,750 --> 00:11:01,970
number 9. Our linear regression model representation
115
00:11:01,970 --> 00:11:09,640
for this problem would be: y ̂ = θ0 + θ1 x1.
116
00:11:09,640 --> 00:11:19,600
Or if we map it to our dataset, it would be Co2Emission = θ0 + θ1 EngineSize.
117
00:11:19,600 --> 00:11:26,210
As we saw, we can find θ0, θ1 using the equations that we just talked about.
118
00:11:26,210 --> 00:11:31,080
Once found, we can plug in the equation of the linear model.
119
00:11:31,080 --> 00:11:44,480
For example, let’s use θ0=125 and θ1=39. So, we can rewrite the linear model as 𝐶𝑜2𝐸𝑚𝑖𝑠𝑠𝑖𝑜𝑛=125+39𝐸𝑛𝑔𝑖𝑛𝑒𝑆𝑖𝑧𝑒.
120
00:11:44,480 --> 00:11:55,310
Now, let’s plug in the 9th row of our dataset and calculate the Co2 Emission for a car with
121
00:11:55,310 --> 00:12:05,770
an EngineSize of 2.4. So Co2Emission = 125 + 39 × 2.4.
122
00:12:05,770 --> 00:12:14,020
Therefore, we can predict that the Co2 Emission for this specific car would be 218.6.
123
00:12:14,020 --> 00:12:20,130
Let’s talk a bit about why Linear Regression is so useful.
124
00:12:20,130 --> 00:12:25,320
Quite simply, it is the most basic regression to use and understand.
125
00:12:25,320 --> 00:12:30,730
In fact, one reason why Linear Regression is so useful is that it’s fast!
126
00:12:30,730 --> 00:12:36,350
It also doesn’t require tuning of parameters. So, something like tuning the K parameter
127
00:12:36,350 --> 00:12:41,990
in K-Nearest Neighbors or the learning rate in Neural Networks isn’t something to worry
128
00:12:41,990 --> 00:12:45,860
about. Linear Regression is also easy to understand
129
00:12:45,860 --> 00:12:48,460
and highly interpretable.
130
00:12:48,460 --> 00:12:50,220
Thanks for watching this video.