Philipp Burckhardt

On Statistics, Programming and the Social Sciences

Linear Regression

Most introductory textbooks on econometrics, e.g. "Introduction to Econometrics" by Stock & Watson, start with the univariate regression case in which dependant variable y is explained by a single explanatory variable x. Precisely, one investigates the following relationship:

\begin{equation}
{Y_i} = \beta_0 + \beta_1 \cdot {X_i} + {\varepsilon _i}
\end{equation}

This was exactly the situation we faced in a previous post, in which we asked how women’s weight is connected to their body height. Additionally, strict exogeneity must hold, that is $E\left[ {{\varepsilon _i}\left| {{X_i}} \right.} \right] = 0$.

For the estimation of $\beta_1$, most of the time the following rather obscure formula is displayed:
\begin{equation}
{\hat \beta _1} = \frac{{\frac{1}{{n - 1}}\sum\nolimits_{i = 1}^n {\left( {{X_i} - \bar X} \right)\left( {{Y_i} - \bar Y} \right)} }}{{\frac{1}{{n - 1}}\sum\nolimits_{i = 1}^n {{{\left( {{X_i} - \bar X} \right)}^2}} }}
\end{equation}

Since I am not good at remembering things, I would like to present a simple derivaton of this result, sparing myself of the burden to keep more and more formulas in mind. Let us write $Cov\left( {{Y_i},{X_i}} \right) $ and then insert above definition of $Y_i$. Using the rules of dealing with covariances, this expression then simplifies to:

\begin{align*}
Cov\left( {{Y_i},{X_i}} \right) &= Cov\left( {\beta_0 + \beta _1 \cdot {x_i} + {\varepsilon _i},{X_i}} \right) \\\\
&= Cov\left( {\beta _1 \cdot {X_i},{X_i}} \right) + Cov\left( {{\varepsilon _i},{X_i}} \right) \\\\
&= \beta _1 \cdot Cov\left( {{X_i},{X_i}} \right) \\\\
&={\beta _1}\cdot Var\left( {{X_i}} \right)
\end{align*}

Solving for $\beta_1$, we arrive at:
\[ \to \beta _1 = \frac{{Cov\left( {{Y_i},{X_i}} \right)}}{{Var\left( {{X_i}} \right)}} \]

In this derivation, $Cov\left( {{\varepsilon _i},{X_i}} \right) = 0$ follows from the strict exogeneity assumption.
Proof:
1. Write down the definition of covariance:
\begin{equation}
Cov\left( {{\varepsilon _i},{X_i}} \right) = E\left[ {{\varepsilon _i}{X_i}} \right] - E\left[ {{\varepsilon _i}} \right]E\left[ {{X_i}} \right]
\end{equation}
2. By the Law of Total Expectation, we have:
\[E\left[ {{\varepsilon _i}} \right] = E\left[ {E\left[ {{\varepsilon _i}\left| {{X_i}} \right.} \right]} \right] = 0\]
3. Similarly, we have
\[E\left[ {{\varepsilon _i}{X_i}} \right] = E\left[ {E\left[ {{\varepsilon _i}{X_i}\left| {{x_i}} \right.} \right]} \right] = E\left[ {{X_i}E\left[ {{\varepsilon _i}\left| {{X_i}} \right.} \right]} \right] = 0\]
4. It follows that the error term and our regressor are uncorrelated, that is $ Cov\left( {{\varepsilon _i},{X_i}} \right) = 0 $ because both terms on the right-hand side of Equation (3) vanish.

Finally, to estimate $\beta_1$, we replace $ Var\left( {{X_i}} \right) $ and $ Cov\left( {{Y_i},{X_i}} \right) $ with their sample equivalents, which results in Equation (2) for $ {\hat \beta _1} $.

Turning our attention to estimating $ \beta_0 $, we take the expectation of (1):
\[E\left[ {{Y_i}} \right] = E\left[ {{\beta _0} + {\beta _1}\cdot{X_i} + {\varepsilon _i}} \right]\]
Simplifying, we reach:

\begin{align*}
E\left[ {{y_i}} \right] &= E\left[ {{\beta _0} + {\beta _1}\cdot {x_i} + {\varepsilon _i}} \right] \\
& = {\beta _0} + {\beta _1}\cdot E\left[ {{x_i}} \right] + E\left[ {{\varepsilon _i}} \right] \quad
\text{(by Linearity of Expectations)} \\\\
&= {\beta _0} + {\beta _1}\cdot E\left[ {{x_i}} \right] \\
\end{align*}

Replacing all elements with their sample equivalents, we get an estimate for $ \beta_0 $:
\[{{\hat \beta }_0} = \bar y - {{\hat \beta }_1} \cdot \bar x\]

Reading the book \"Mostly Harmless Econometrics\" by Angrist and Pischke motivated me to write this post. They approach regression analysis from a different angle than many other textbooks which focus heavily on the linear algebra. The reading experience has hugely benefited me, giving me an entirely different view on many topics.

I think that the book is most suitable for all those who have a working knowledge of regression and want to broaden their perspective and dig deeper, as it is written rather sketchy and not every formula is explained in detail or properly derived. But it is a fun and interesting read, and there is even a chapter on why regression is called regression, something I had asked myself for quite some time. If you want to find out, check it out.