ESLII Chapter 3: Linear Methods for Regression

General Tips for Chapter 3:

Tip #1: Understanding Equations

When looking at equations, it would help to think simple when analyzing equations for an intuitive sense of whats going on. I like to think of N as being equal to 1 unless otherwise stated (or if there is a reason it can’t be), where you only have one data point. Also, know what assumptions you can make, and how that translates to math, when trying to solve the equations yourself. There are many subtlties in the wording and equations. For example, if you have $XX^T$ vs $X^TX$, why do we use one or the other? Assuming that X is zero centered, and is a vector, then these would both represent variance. $XX^T$ would represent a matrix of how much each of the features covary (covarince matrix scaled by N). If you took the diagonal of this matrix as a vector, then it would represent a elementwise multiplication of the $X$ vector with itself. $X^TX$ on the other hand, represents sum of the total variance across each feature. In otherwords, it is the sum of the diagonal of the $XX^T$ matrix. You can see how, each of these differs in a practical sense. One gives a detailed breakdown of the variances across each feature, and the other gives an overall view of the variance in the system.

Make sure that you know the purpose behind each equation, and each operation. Looking at equations as just numbers, variables and functions is like reading a book by just paying attention to each single word, instead of taking in what the overall story is. Instead, keep in mind what each aspect of an equation represents, and what it means to be doing a certain operation. In the future, this will help you translate theory into math, then further on into effcient algorithms. This doesn’t mean to neglect proving them yourself, proving them yourself will get you familiar with the tools you can use.

One thing that will help with representations, is by looking at an equation vector-wise. In a $Nxp$ matrix $X$, $N$ signifies the datapoints, $p$ signifies the features. Becareful of which part is multiplying which, sometimes, one matters more than the other.

Tip #2: Other General Tips:

Chapter 3.1

What does “E(Y|X) linear in the inputs” mean?

Assuming that Y and X have a linear relationship is a direct way to find if they are proportional/correlated. Since this method is simple, and gets the job done to a sufficent degree for many tasks, it is used often.

Since we are making such heavy structural assumptions, which leaves few degrees of freedom that the data has to tune, this model uses less data than non linear models.

Chapter 3.2 Linear Regression Models and Least Squares

Introduction

X might not be purely data. It can also be a transformation of your data.

why is (3.2) reasonable only on the assumption that the data is independent, random draws?

The why is it the criterion still valid if the the data is not random and independent? Why does the $y_i$’s have to be conditionally independent given the x?

Here is a very interesting observation. Notice figure 3.2, and how it makes sense. We are using $X$ to construct $\hat{Y}$. $\hat{Y}$ is literally using the $X$ as components. $\hat{Y} = X\hat{\beta}$. But how can be think of this a projection?

This is an interesting problem that will be addressed later, so keep it in mind. If the features are not all linearly independent with one another, Then some of the $\hat{\beta}$ will not be unique. This means that we will not have a unique solution for our training data. This is a problem as some solutions might generalize better than others, and we would like to have the best one.

When deriving some of these equations, we will assume linearity. This is done by letting $y = X\beta +\epsilon$

To get the variance under (3.8):

What is the point of finding these? Remember that we are trying to find out the properties of $\hat{$\beta}$. We needed to estimate $\sigma$ since it was in our equation of variance for $\hat{$\beta}$

(N-p-1)$\hat{\sigma}$ is sampled from a chi-squared distribution, as we are naturally adding together the dimensions across $y-\hat{y}$, where the distribution across each of these N-p-1 dimensions is normally distributed. (chi-squared is a distribution of a sum of normals.)

Why is $\hat{\beta}$ is statistically independent with $\hat{\sigma}$? $\hat{\beta}$ exists in the X space of p+1 dimensions, and $\hat{\sigma}$ exists in the orthogonal space of N-p-1 dimensions.

What is the point of finding the distribution of $\hat{\beta}$? We can now do hypothesis tests and confidence intervals to evaluate our $\hat{\beta_j}$.

We want to test $\hat{\beta_j}$ to see if it is 0. This way, we will know if the corresponding input is correlated/contribute to the output.

For (3.12), $v_j$ represents the standard deviation of the jth feature vector. (3.12) is just using the variance that we have calculated above, elementwise.

3.2.1 Example: Prostate Cancer

3.2.2 The Gauss-Markov Theorem

Remember that least squares, the linear regression model that we were using is unbiased, as the expected value of the true $y$ is equal to our estimated $\hat{y}$. The bias we are talking about is a mathematical bias. Here, we assume that $y = X\beta$, which corresponds with model. However notice how we are only putting an assumption on the structure of the data. There are many other methods on how we can build in assumptions. For example, we can also assume/bias towards certain kinds of weights, or we can make assumptions on the size of our models. Why would we want to do this? We want to make up for our lack of trainning data through assumptions. This means that we would like to decrease the variance of our model. Methods on how to do will be explained further down. For now we will see how, even though least squares is the unbiased linear estimator with the smallest variance, this variance might still be too high, and we would need to add new, biased methods to compensate.

The gauss markov theorem shows how least squares is has the smallest variance across all unbiased linear estimators.

For (3.20), they use bias variance decomposition for estimators, Since we are using estimators, we can assume that the true $\theta$ is constant, as we are just estimating given our current data.

for (3.22), we are now dealing with a predictor, which means that we would have to factor in the error of the true model as well now.

3.2.3 Multiple Regression from Simple Univariate Regression

TBD