This assignment explores linear regression modeling using a new dataset, focusing on assumption checks and model interpretation. Students will fit a regression model, check assumptions, and interpret the results in a structured, easy-to-grade format.
The response variable is medv
(median home value), and
predictors include rm
(average rooms per house),
lstat
(percentage of lower-income population), and
crim
(crime rate per capita).
##
## Call:
## lm(formula = medv ~ rm + lstat + crim, data = boston_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.925 -3.567 -1.157 1.906 29.024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.56225 3.16602 -0.809 0.41873
## rm 5.21695 0.44203 11.802 < 2e-16 ***
## lstat -0.57849 0.04767 -12.135 < 2e-16 ***
## crim -0.10294 0.03202 -3.215 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.49 on 502 degrees of freedom
## Multiple R-squared: 0.6459, Adjusted R-squared: 0.6437
## F-statistic: 305.2 on 3 and 502 DF, p-value: < 2.2e-16
- Residuals vs Fitted Plot: If a pattern exists,
non-linearity may be present. - Q-Q Plot: Checks if
residuals are normally distributed. - Scale-Location
Plot: Detects heteroscedasticity. - Residuals vs
Leverage: Identifies influential observations.
If assumptions are violated, log-transforming variables can improve the model.
boston_data <- boston_data %>% mutate(
log_medv = log(medv),
log_lstat = log(lstat + 1)
)
log_model <- lm(log_medv ~ rm + log_lstat + crim, data = boston_data)
summary(log_model)
##
## Call:
## lm(formula = log_medv ~ rm + log_lstat + crim, data = boston_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70866 -0.11815 -0.01845 0.11886 0.89289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.587817 0.155975 23.003 < 2e-16 ***
## rm 0.101366 0.017589 5.763 1.44e-08 ***
## log_lstat -0.464013 0.024455 -18.974 < 2e-16 ***
## crim -0.011523 0.001178 -9.780 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2078 on 502 degrees of freedom
## Multiple R-squared: 0.743, Adjusted R-squared: 0.7415
## F-statistic: 483.8 on 3 and 502 DF, p-value: < 2.2e-16
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(medv ~ rm + lstat + crim, data = boston_data, method = "lm", trControl = train_control)
cv_model
## Linear Regression
##
## 506 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 455, 456, 456, 456, 456, 456, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 5.487973 0.6455425 3.921115
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Answer the following questions and submit your answer in a nice R Markdown file.
rm
mean in the original
model? (1 point)rm
means that for each additional
room in a house, the median home value increases by coefficient units,
holding other variables constant. In this case, the coefficient is
5.2169, indicated with each additional room, housing value on average
increase 5.21 units, with holding other predictors constant.lstat
impact median home value, and why? (1
point)lstat
is less than 0, which
means that the percentage of the lower-income residents is negatively
correlate with the median home values. The residents with lower income
tends to live in the neighborhood with lower home values.crim
) statistically significant? Justify
using the p-value. (1 point)crim
is statistically
significant.The residual vs fitted value plot shows a slight pattern, indicating that the relationship between the predictors and the dependent variable may not be linear. The ideal plot should have no pattern and a straight red-line along 0.
The Q-Q plot shows that the residuals are not perfectly normally distributed, especially at the high end. However, in considering the outliers, the residuals are generally normally distributed.
The scale-location plot shows that the residuals are general homoscedastic, with few outliers around 20. Residuals are generally spread out evenly across the range of fitted values.
In summary,the models met the homoscedasticity and normality of the residuals assumptions, but there is a slight violation of the linearity assumption. The multicollinearity check will be conducted via the VIF test.
lstat
, does model performance
improve? Explain. (1 point)Yes, after log-transforming lstat
, the model performance
improve. The adjusted R² value of the log-transformed model is 0.7415,
which is higher than the original model’s adjusted R² value of 0.6437.
This indicate the new log-transformed model explains 74.15% of the
variance in logged median household values, while the original model
could only explain 64.37% of the variance in median home values. The
log-transformed model has a better fit than the original model.
The RMSE from cross-validation is 5.48, while the model’s residual standard error is 5.49. The cross validation has a lower RMSE than the original model residual standard error, which suggests that the cross-validation model has a better predictive performance than the original model. The cross-validation was trained on a portion of the data and tested on the remaining data, which avoid the overfitting issue and provide a more accurate estimate of the model’s predictive performance.
dis
(distance to employment centers) to the model.
Does it improve fit? (1 point)##
## Call:
## lm(formula = medv ~ rm + lstat + crim + dis, data = boston_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.006 -3.099 -1.047 1.885 26.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.23065 3.32214 0.671 0.502
## rm 4.97649 0.43885 11.340 < 2e-16 ***
## lstat -0.66174 0.05101 -12.974 < 2e-16 ***
## crim -0.12810 0.03209 -3.992 7.53e-05 ***
## dis -0.56321 0.13542 -4.159 3.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.403 on 501 degrees of freedom
## Multiple R-squared: 0.6577, Adjusted R-squared: 0.6549
## F-statistic: 240.6 on 4 and 501 DF, p-value: < 2.2e-16
## rm lstat crim dis
## 1.645020 2.295404 1.318232 1.406845
Yes, adding the dis
variable to the model improve the
fit. The adjusted R² value of the new model is 0.6549, which is higher
than the original model. The new model explain more variance in median
home values than the original model, which improve the model fits.
##
## Call:
## lm(formula = medv ~ rm + lstat + crim + +dis + age, data = boston_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.024 -3.101 -1.004 1.847 27.474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.58935 3.38683 1.060 0.2897
## rm 5.10405 0.44261 11.532 < 2e-16 ***
## lstat -0.61926 0.05541 -11.176 < 2e-16 ***
## crim -0.13018 0.03202 -4.065 5.57e-05 ***
## dis -0.77733 0.17466 -4.450 1.06e-05 ***
## age -0.02738 0.01416 -1.933 0.0538 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.388 on 500 degrees of freedom
## Multiple R-squared: 0.6602, Adjusted R-squared: 0.6568
## F-statistic: 194.3 on 5 and 500 DF, p-value: < 2.2e-16
## rm lstat crim dis age
## 1.682412 2.723742 1.319717 2.353138 2.765585
I added the age
variable to the model, because I
assume that the housing age has potential relationship with the median
housing value.
The coefficient of the age
variable is -0.1237,
which means that for each additional year of age, the average median
home value decreases by 0.027 unit, holding other variables constant.
The p-value of the age
variable is 0.0538, which is
slightly above the 0.05 threshold. This indicates that the
age
variable is not statistically significant in predicting
median home values.
However, the adjusted R² value of the new model is 0.6568, which is higher than the original model’s adjusted R² value of 0.6549. The new model explain more variance in median home values than the original model.
Overall, as the age
variable is not statistically
significant, it may not be a good predictor for the median home values.
Adding age
variable, only improve the adjusted R² value by
0.0019. I may consider removing the age
variable from the
model.