# Linear Regression in R

One of the nice things of R is the fact that it comes equipped with a large number of data sets to experiment with. A list of all datasets included in the installed packages can be obtained by calling:

data()


In this post, I would like to investigate the “women” dataset. To load the dataset into the current workspace, we may call:

data(women)


Having loaded the data set, it might be a good idea to take a closer look at it. By typing the name of an object in the command line, it gets printed:

women

##    height weight
## 1      58    115
## 2      59    117
## 3      60    120
## 4      61    123
## 5      62    126
## 6      63    129
## 7      64    132
## 8      65    135
## 9      66    139
## 10     67    142
## 11     68    146
## 12     69    150
## 13     70    154
## 14     71    159
## 15     72    164


Apparently, this object contains information for the height and weight of a sample of 15 women. However, this gets rather overwhelming with larger data frames. To get summary statistics for the height and weight variables, we may call

summary(women)

##      height         weight
##  Min.   :58.0   Min.   :115
##  1st Qu.:61.5   1st Qu.:124
##  Median :65.0   Median :135
##  Mean   :65.0   Mean   :137
##  3rd Qu.:68.5   3rd Qu.:148
##  Max.   :72.0   Max.   :164


A side note: The function summary(), like most functions of R, behaves differently depending on the class of the passed object. When a data.frame is passed as in our case, the function prints summary statistics for all the columns of said data.frame. The class of any given object can be extracted by calling the class() function. In the near future, I will write a blog entry on object-oriented programming in R, a topic I would like to study more deeply, the more so as R's implementation of object-orientated programming seems rather awkward to me at a first glance. However, for now we do not need to bother if we just use pre-implemented functions.

To examine the relationship between the two variables, it probably is a good idea to request a plot of our two variables in the xy-plane. As of today, R offers thousands of packages created by the community, and I would assume that even the most esoteric statistical methods are available. One of the coolest packages developed for the R environment is the ggplot2 package by Hadley Wickham which enables oneself to create gorgeous plots with ease, while at the same time offering a huge amount of flexibility because it is based on “The Grammar of Graphics” by Leland Wilkinson. After installing a package, it can be loaded by calling library().
So let's create a scatterplot:

library(ggplot2)
qplot(height, weight, data = women)


As can be seen, the relationship is almost perfectly close to a linear function. So let us fit a linear model with weight as the dependent variable and height as the single explanatory variable. In other statistics software like SPSS or EViews, one gets a huge amount of output when conducting a certain statistical analysis. In R, one normally creates a model object from which one then later extracts information whenever necessary.

To fit our model, we call:

mymodel <- lm(weight ~ height, data = women)


To inspect the results, type

summary(mymodel)

##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -1.733 -1.133 -0.383  0.742  3.117
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.5167     5.9369   -14.7  1.7e-09 ***
## height        3.4500     0.0911    37.9  1.1e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.53 on 13 degrees of freedom
## Multiple R-squared: 0.991,   Adjusted R-squared: 0.99
## F-statistic: 1.43e+03 on 1 and 13 DF,  p-value: 1.09e-14
##


Finally, we might want to include the fitted values in our plot.

mymodel.fitted <- fitted(mymodel)
qplot(height, weight, data = women) + geom_line(aes(height, mymodel.fitted))