Linear Regression

Most introductory textbooks on econometrics, e.g. "Introduction to Econometrics"
by Stock & Watson, start with the univariate regression case in which dependant
variable y is explained by a single explanatory variable x. Precisely, one
investigates the following relationship:

\begin{equation}
{Y_i} = \beta_0 + \beta_1 \cdot {X_i} + {\varepsilon _i}
\end{equation}

This was exactly the situation we faced in a previous post, in which we asked
how women’s weight is connected to their body height. Additionally, strict
exogeneity must hold, that is $E\left[ {{\varepsilon _i}\left| {{X_i}} \right.} \right] = 0$.

For the estimation of $\beta_1$, most of the time the following rather obscure
formula is displayed:
\begin{equation}
{\hat \beta 1} = \frac{{\frac{1}{{n - 1}}\sum\nolimits{i = 1}^n {\left( {{X_i}

• \bar X} \right)\left( {{Y_i} - \bar Y} \right)} }}{{\frac{1}{{n -
1}}\sum\nolimits_{i = 1}^n {{{\left( {{X_i} - \bar X} \right)}^2}} }}
\end{equation}

Since I am not good at remembering things, I would like to present a simple\nderivaton of this result, sparing myself of the burden to keep more and more\nformulas in mind. Let us write $Cov\left( {{Y_i},{X_i}} \right)$ and then\ninsert above definition of $Y_i$. Using the rules of dealing with covariances,\nthis expression then simplifies to:

\begin{align*}
Cov\left( {{Y_i},{X_i}} \right) &= Cov\left( {\beta_0 + \beta _1 \cdot {x_i} +\n
{\varepsilon _i},{X_i}} \right) n&= Cov\left( {\beta _1 \cdot {X_i},\n{X_i}} \right) + Cov\left( {{\varepsilon_i},{X_i}} \right) \n&= \beta _1 \cdot Cov\left( {{X_i},{X_i}} \right) n&={\beta _1}\cdot Var\left( {{X_i}} \right)
\end{align*}

Solving for $$\beta_1$$, we arrive at:
$\to \beta _1 = \frac{{Cov\left( {{Y_i},{X_i}} \right)}}{{Var\left( {{X_i}} \right)}}$

In this derivation, $Cov\left( {{\varepsilon _i},{X_i}} \right) = 0$ follows\nfrom the strict exogeneity assumption.
Proof:

1. Write down the definition of covariance:
\begin{equation}
Cov\left( {{\varepsilon _i},{X_i}} \right) = E\left[ {{\varepsilon _i}{X_i}}
\right] - E\left[ {{\varepsilon _i}} \right]E\left[ {{X_i}} \right]
\end{equation}
2. By the Law of Total Expectation, we have:
$E\left[ {{\varepsilon _i}} \right] = E\left[ {E\left[ {{\varepsilon _i}\left| {{X_i}} \right.} \right]} \right] = 0$
3. Similarly, we have
$E\left[ {{\varepsilon _i}{X_i}} \right] = E\left[ {E\left[ {{\varepsilon _i}{X_i}\left| {{x_i}} \right.} \right]} \right] = E\left[ {{X_i}E\left[ {{\varepsilon _i}\left| {{X_i}} \right.} \right]} \right] = 0$
4. It follows that the error term and our regressor are uncorrelated, that is $Cov\left( {{\varepsilon _i},{X_i}} \right) = 0$ because both terms on the
right-hand side of Equation (3) vanish.

Finally, to estimate $\beta_1$, we replace $Var\left( {{X_i}} \right)$ and $Cov\left( {{Y_i},{X_i}} \right)$ with their sample equivalents, which results
in Equation (2) for ${\hat \beta _1}$.

Turning our attention to estimating $\beta_0$, we take the expectation of (1):

$E\left[ {{Y_i}} \right] = E\left[ {{\beta _0} + {\beta _1}\cdot{X_i} + {\varepsilon _i}} \right]$
Simplifying, we reach:

\begin{align*}
E\left[ {{y_i}} \right] &= E\left[ {{\beta _0} + {\beta _1}\cdot {x_i} +
{\varepsilon _i}} \right]
& = {\beta _0} + {\beta _1}\cdot E\left[ {{x_i}} \right] + E\left[ {{\varepsilon
\text{(by Linearity of Expectations)} n&= {\beta _0} + {\beta _1}\cdot E\left[ {{x_i}} \right]
\end{align*}

Replacing all elements with their sample equivalents, we get an estimate for $\beta_0$:
${{\hat \beta }_0} = \bar y - {{\hat \beta }_1} \cdot \bar x$

Reading the book "Mostly Harmless Econometrics" by Angrist and Pischke
motivated me to write this post. They approach regression analysis from a
different angle than many other textbooks which focus heavily on the linear
algebra. The reading experience has hugely benefited me, giving me an entirely
different view on many topics.

I think that the book is most suitable for all those who have a working
knowledge of regression and want to broaden their perspective and dig deeper, as
it is written rather sketchy and not every formula is explained in detail or
properly derived. But it is a fun and interesting read, and there is even a
chapter on why regression is called regression, something I had asked myself for
quite some time. If you want to find out, check it out.