lm Functionlm(formula,
data,
subset,
na.action)
formula - Specification of our regression model
data - The dataset containing the variables of the regression
subset - An option to subset the data
na.action - Option that specifies how to deal with missing values
formula ArgumentWe can write our models using the following syntax:
model = formula(regressand ~ regressors)
Where regressand is just our dependent variable / response usually denoted by \(y\) and model is our formula of independent variables / regressors, e.g.:
happy_model = formula(happiness ~ age + income + n_children + married)
We can construct formulas with the following syntax:
+formula(y ~ a + b)
:formula(y ~ a + b + a:b)
a * b is equivalent to a + b + a:bformula(y ~ a + b + a:b) # and
formula(y ~ a*b) # are equivalent
I()formula(y ~ a + I(a^2)) # quadratic term must be in I() to evaluate correctly
formula(y ~ log(a)) # log can stay by itself
.formula(y ~ .) # is equivalent to
formula(y ~ a + b + ... + z) # for a dataset with variables from a to z
-formula(y ~ .-a ) # is equivalent to
formula(y ~ b + c + ... + z)# for a dataset with variables from a to z
subset Argumentlm(formula,
data,
subset = age < 30)
lm(formula,
data,
subset = age < 30 & height > 180)
Note that although this works, a best-practice is to subset your data prior to the estimation. By keeping these steps distinct, your code will be much easier for someone else to understand.
na.action ArgumentIf the data contains missing values, lm automatically deletes the whole observation.
na.action = na.fail if you want an error when the data contains missing valuesAgain, it is a best-practice to look for missing values in your data prior to the estimation to keep your code transparent.
missmap function from the Amelia package to get a nice visualisation of missing values in your datalm with Wage Datawage_data <- read.csv2("data/offline/wages2.csv")
head(wage_data)
## WAGE HOURS IQ SCORES EDUC EXPER TENURE AGE MARRIED BLACK SOUTH URBAN
## 1 769 40 93 35 12 11 2 31 1 0 0 1
## 2 808 50 119 41 18 11 16 37 1 0 0 1
## 3 825 40 108 46 14 11 9 33 1 0 0 1
## 4 650 40 96 32 12 13 7 32 1 0 0 1
## 5 562 40 74 27 11 14 5 34 1 0 0 1
## 6 1400 40 116 43 16 14 2 35 1 1 0 1
m1 <- formula(WAGE ~ EDUC + EXPER)
model<- lm(formula = m1,
data = wage_data)
lmThe lm function returns a list. Relevant components of this list are:
call - the function call that generated the outputcoefficients the OLS coefficientsresidualsfitted.values The estimates for our dpendent variable (WAGE)model The model matrix used for estimationThe full list of outputs can be looked up via
?lm()str(model) where model is our saved output from lm$ operator and tab, e.g. model$...Lets look up our coefficients \(\beta\), fitted values \(\hat{y}\) and OLS residuals \(\varepsilon\)
model$coefficients
## (Intercept) EDUC EXPER
## -272.52786 76.21639 17.63777
model$fitted.values[1:7] # first 7 fitted values
## 1 2 3 4 5 6 7
## 836.0843 1293.3826 988.5170 871.3598 812.7812 1193.8631 718.9270
model$residuals[1:7] # first 7 residuals
## 1 2 3 4 5 6
## -67.08427 -485.38260 -163.51705 -221.35981 -250.78119 206.13686
## 7
## -118.92703
We can visualise the results very simply with hist or plot:
hist(model$residuals, breaks = 30)

hist(model$fitted.values, breaks = 30)

lm with the summary() function##
## Call:
## lm(formula = m1, data = wage_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -924.38 -252.74 -40.88 198.16 2165.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -272.528 107.263 -2.541 0.0112 *
## EDUC 76.216 6.297 12.104 < 2e-16 ***
## EXPER 17.638 3.162 5.578 3.18e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 376.3 on 932 degrees of freedom
## Multiple R-squared: 0.1359, Adjusted R-squared: 0.134
## F-statistic: 73.26 on 2 and 932 DF, p-value: < 2.2e-16
stargazer()stargazer::stargazer(model, type = "text", style = "asr" )
##
## -------------------------------------------
## WAGE
## -------------------------------------------
## EDUC 76.216***
## EXPER 17.638***
## Constant -272.528*
## N 935
## R2 0.136
## Adjusted R2 0.134
## Residual Std. Error 376.295 (df = 932)
## F Statistic 73.260*** (df = 2; 932)
## -------------------------------------------
## *p < .05; **p < .01; ***p < .001
stargazer::stargazer(model,
type = "html",
out = "model.html")
model2 <- lm(WAGE ~ EDUC + EXPER + IQ + SCORES, data = wage_data)
stargazer::stargazer(model, model2, type = "text", style = "asr")
##
## -------------------------------------------------------------------
## WAGE
## (1) (2)
## -------------------------------------------------------------------
## EDUC 76.216*** 47.472***
## EXPER 17.638*** 13.733***
## IQ 3.795***
## SCORES 8.736***
## Constant -272.528* -536.848***
## N 935 935
## R2 0.136 0.182
## Adjusted R2 0.134 0.179
## Residual Std. Error 376.295 (df = 932) 366.466 (df = 930)
## F Statistic 73.260*** (df = 2; 932) 51.788*** (df = 4; 930)
## -------------------------------------------------------------------
## *p < .05; **p < .01; ***p < .001
Specify the folder and file were your table should be saved as "path/name.type"
.html : Open the file in your web browser and copy it into Word.tex : Include in LaTeXWe can test for significance of single parameters with t-tests
EDUC has an influence on WAGEEDUC changes when more variables are addedWe can test for joint significance of a group of variables with F-tests
TENURE, EXPER and SCORES significant, once we control for personal variables like IQ and EDUCmodel3 <- lm(WAGE ~ IQ + EDUC, data = wage_data)
model4 <- lm(WAGE ~ IQ + EDUC + TENURE + EXPER + SCORES, data = wage_data)
stargazer::stargazer(model3, model4, type = "text", style = "asr")
##
## -------------------------------------------------------------------
## WAGE
## (1) (2)
## -------------------------------------------------------------------
## IQ 5.138*** 3.697***
## EDUC 42.058*** 47.270***
## TENURE 6.247*
## EXPER 11.859***
## SCORES 8.270***
## Constant -128.890 -531.039***
## N 935 935
## R2 0.134 0.188
## Adjusted R2 0.132 0.183
## Residual Std. Error 376.730 (df = 932) 365.393 (df = 929)
## F Statistic 72.015*** (df = 2; 932) 42.967*** (df = 5; 929)
## -------------------------------------------------------------------
## *p < .05; **p < .01; ***p < .001
\(WAGE = \beta_0 + \beta_{1}IQ + \beta_{2}EDUC + \beta_{3}TENURE + \beta_{4}EXPER + \beta_{5}SCORES\)
TENURE, EXPER and SCORES is zero:\[ H_0: \beta_{3} = \beta_{4} = \beta_{5} = 0 \\ H_1: H_0 \text{ is not true}\]
Since OLS minimises the \(SSR\), the \(SSR\) always increases if we drop variables. The question is, if that increase is significant.
\[ F = \frac{SSR_r - SSR_{ur}/q}{SSR_{ur} / (n-k-1)} \sim F_{q, n-k-1} \\ q = \text{number of restrictions} \\ k = \text{number of parameters}\]
If the \(H_0\) is true, our test statistic follows the \(F-Distribution\) and we can calculate p-values for our test.
anova(model3, model4)
## Analysis of Variance Table
##
## Model 1: WAGE ~ IQ + EDUC
## Model 2: WAGE ~ IQ + EDUC + TENURE + EXPER + SCORES
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 932 132274591
## 2 929 124032931 3 8241660 20.576 6.495e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1