lm
Functionlm(formula,
data,
subset,
na.action)
formula
- Specification of our regression model
data
- The dataset containing the variables of the regression
subset
- An option to subset the data
na.action
- Option that specifies how to deal with missing values
formula
ArgumentWe can write our models using the following syntax:
model = formula(regressand ~ regressors)
Where regressand
is just our dependent variable / response usually denoted by \(y\) and model
is our formula of independent variables / regressors, e.g.:
happy_model = formula(happiness ~ age + income + n_children + married)
We can construct formulas with the following syntax:
+
formula(y ~ a + b)
:
formula(y ~ a + b + a:b)
a * b
is equivalent to a + b + a:b
formula(y ~ a + b + a:b) # and
formula(y ~ a*b) # are equivalent
I()
formula(y ~ a + I(a^2)) # quadratic term must be in I() to evaluate correctly
formula(y ~ log(a)) # log can stay by itself
.
formula(y ~ .) # is equivalent to
formula(y ~ a + b + ... + z) # for a dataset with variables from a to z
-
formula(y ~ .-a ) # is equivalent to
formula(y ~ b + c + ... + z)# for a dataset with variables from a to z
subset
Argumentlm(formula,
data,
subset = age < 30)
lm(formula,
data,
subset = age < 30 & height > 180)
Note that although this works, a best-practice is to subset your data prior to the estimation. By keeping these steps distinct, your code will be much easier for someone else to understand.
na.action
ArgumentIf the data contains missing values, lm
automatically deletes the whole observation.
na.action = na.fail
if you want an error when the data contains missing valuesAgain, it is a best-practice to look for missing values in your data prior to the estimation to keep your code transparent.
missmap
function from the Amelia
package to get a nice visualisation of missing values in your datalm
with Wage Datawage_data <- read.csv2("data/offline/wages2.csv")
head(wage_data)
## WAGE HOURS IQ SCORES EDUC EXPER TENURE AGE MARRIED BLACK SOUTH URBAN
## 1 769 40 93 35 12 11 2 31 1 0 0 1
## 2 808 50 119 41 18 11 16 37 1 0 0 1
## 3 825 40 108 46 14 11 9 33 1 0 0 1
## 4 650 40 96 32 12 13 7 32 1 0 0 1
## 5 562 40 74 27 11 14 5 34 1 0 0 1
## 6 1400 40 116 43 16 14 2 35 1 1 0 1
m1 <- formula(WAGE ~ EDUC + EXPER)
model<- lm(formula = m1,
data = wage_data)
lm
The lm
function returns a list. Relevant components of this list are:
call
- the function call that generated the outputcoefficients
the OLS coefficientsresiduals
fitted.values
The estimates for our dpendent variable (WAGE)model
The model matrix used for estimationThe full list of outputs can be looked up via
?lm()
str(model)
where model is our saved output from lm
$
operator and tab
, e.g. model$...
Lets look up our coefficients \(\beta\), fitted values \(\hat{y}\) and OLS residuals \(\varepsilon\)
model$coefficients
## (Intercept) EDUC EXPER
## -272.52786 76.21639 17.63777
model$fitted.values[1:7] # first 7 fitted values
## 1 2 3 4 5 6 7
## 836.0843 1293.3826 988.5170 871.3598 812.7812 1193.8631 718.9270
model$residuals[1:7] # first 7 residuals
## 1 2 3 4 5 6
## -67.08427 -485.38260 -163.51705 -221.35981 -250.78119 206.13686
## 7
## -118.92703
We can visualise the results very simply with hist
or plot
:
hist(model$residuals, breaks = 30)
hist(model$fitted.values, breaks = 30)
lm
with the summary()
function##
## Call:
## lm(formula = m1, data = wage_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -924.38 -252.74 -40.88 198.16 2165.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -272.528 107.263 -2.541 0.0112 *
## EDUC 76.216 6.297 12.104 < 2e-16 ***
## EXPER 17.638 3.162 5.578 3.18e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 376.3 on 932 degrees of freedom
## Multiple R-squared: 0.1359, Adjusted R-squared: 0.134
## F-statistic: 73.26 on 2 and 932 DF, p-value: < 2.2e-16
stargazer()
stargazer::stargazer(model, type = "text", style = "asr" )
##
## -------------------------------------------
## WAGE
## -------------------------------------------
## EDUC 76.216***
## EXPER 17.638***
## Constant -272.528*
## N 935
## R2 0.136
## Adjusted R2 0.134
## Residual Std. Error 376.295 (df = 932)
## F Statistic 73.260*** (df = 2; 932)
## -------------------------------------------
## *p < .05; **p < .01; ***p < .001
stargazer::stargazer(model,
type = "html",
out = "model.html")
model2 <- lm(WAGE ~ EDUC + EXPER + IQ + SCORES, data = wage_data)
stargazer::stargazer(model, model2, type = "text", style = "asr")
##
## -------------------------------------------------------------------
## WAGE
## (1) (2)
## -------------------------------------------------------------------
## EDUC 76.216*** 47.472***
## EXPER 17.638*** 13.733***
## IQ 3.795***
## SCORES 8.736***
## Constant -272.528* -536.848***
## N 935 935
## R2 0.136 0.182
## Adjusted R2 0.134 0.179
## Residual Std. Error 376.295 (df = 932) 366.466 (df = 930)
## F Statistic 73.260*** (df = 2; 932) 51.788*** (df = 4; 930)
## -------------------------------------------------------------------
## *p < .05; **p < .01; ***p < .001
Specify the folder and file were your table should be saved as "path/name.type"
.html
: Open the file in your web browser and copy it into Word.tex
: Include in LaTeXWe can test for significance of single parameters with t-tests
EDUC
has an influence on WAGE
EDUC
changes when more variables are addedWe can test for joint significance of a group of variables with F-tests
TENURE
, EXPER
and SCORES
significant, once we control for personal variables like IQ
and EDUC
model3 <- lm(WAGE ~ IQ + EDUC, data = wage_data)
model4 <- lm(WAGE ~ IQ + EDUC + TENURE + EXPER + SCORES, data = wage_data)
stargazer::stargazer(model3, model4, type = "text", style = "asr")
##
## -------------------------------------------------------------------
## WAGE
## (1) (2)
## -------------------------------------------------------------------
## IQ 5.138*** 3.697***
## EDUC 42.058*** 47.270***
## TENURE 6.247*
## EXPER 11.859***
## SCORES 8.270***
## Constant -128.890 -531.039***
## N 935 935
## R2 0.134 0.188
## Adjusted R2 0.132 0.183
## Residual Std. Error 376.730 (df = 932) 365.393 (df = 929)
## F Statistic 72.015*** (df = 2; 932) 42.967*** (df = 5; 929)
## -------------------------------------------------------------------
## *p < .05; **p < .01; ***p < .001
\(WAGE = \beta_0 + \beta_{1}IQ + \beta_{2}EDUC + \beta_{3}TENURE + \beta_{4}EXPER + \beta_{5}SCORES\)
TENURE
, EXPER
and SCORES
is zero:\[ H_0: \beta_{3} = \beta_{4} = \beta_{5} = 0 \\ H_1: H_0 \text{ is not true}\]
Since OLS minimises the \(SSR\), the \(SSR\) always increases if we drop variables. The question is, if that increase is significant.
\[ F = \frac{SSR_r - SSR_{ur}/q}{SSR_{ur} / (n-k-1)} \sim F_{q, n-k-1} \\ q = \text{number of restrictions} \\ k = \text{number of parameters}\]
If the \(H_0\) is true, our test statistic follows the \(F-Distribution\) and we can calculate p-values for our test.
anova(model3, model4)
## Analysis of Variance Table
##
## Model 1: WAGE ~ IQ + EDUC
## Model 2: WAGE ~ IQ + EDUC + TENURE + EXPER + SCORES
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 932 132274591
## 2 929 124032931 3 8241660 20.576 6.495e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1