Introduction

This assignment explores linear regression modeling using a new dataset, focusing on assumption checks and model interpretation. Students will fit a regression model, check assumptions, and interpret the results in a structured, easy-to-grade format.


1. Exploratory Data Analysis

1.1 Summary Statistics

summary(boston_data)

1.2 Scatter Plots for Relationship Exploration

ggplot(boston_data, aes(x = rm, y = medv)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Relationship Between Number of Rooms and Median Home Value")


2. Fit a Linear Regression Model

The response variable is medv (median home value), and predictors include rm (average rooms per house), lstat (percentage of lower-income population), and crim (crime rate per capita).

model <- lm(medv ~ rm + lstat + crim, data = boston_data)
summary(model)
##
## Call:
## lm(formula = medv ~ rm + lstat + crim, data = boston_data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -17.925  -3.567  -1.157   1.906  29.024
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.56225    3.16602  -0.809  0.41873
## rm           5.21695    0.44203  11.802  < 2e-16 ***
## lstat       -0.57849    0.04767 -12.135  < 2e-16 ***
## crim        -0.10294    0.03202  -3.215  0.00139 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.49 on 502 degrees of freedom
## Multiple R-squared:  0.6459, Adjusted R-squared:  0.6437
## F-statistic: 305.2 on 3 and 502 DF,  p-value: < 2.2e-16

3. Checking OLS Assumptions

3.1 Residual Diagnostics

par(mfrow = c(2, 2))
plot(model)

- Residuals vs Fitted Plot: If a pattern exists, non-linearity may be present. - Q-Q Plot: Checks if residuals are normally distributed. - Scale-Location Plot: Detects heteroscedasticity. - Residuals vs Leverage: Identifies influential observations.

3.2 Multicollinearity Check

vif(model)
##       rm    lstat     crim
## 1.616468 1.941883 1.271372
  • VIF values above 5 indicate high multicollinearity.

4. Model Refinement: Log Transformations

If assumptions are violated, log-transforming variables can improve the model.

boston_data <- boston_data %>% mutate(
  log_medv = log(medv),
  log_lstat = log(lstat + 1)
)

log_model <- lm(log_medv ~ rm + log_lstat + crim, data = boston_data)
summary(log_model)
##
## Call:
## lm(formula = log_medv ~ rm + log_lstat + crim, data = boston_data)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.70866 -0.11815 -0.01845  0.11886  0.89289
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.587817   0.155975  23.003  < 2e-16 ***
## rm           0.101366   0.017589   5.763 1.44e-08 ***
## log_lstat   -0.464013   0.024455 -18.974  < 2e-16 ***
## crim        -0.011523   0.001178  -9.780  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2078 on 502 degrees of freedom
## Multiple R-squared:  0.743,  Adjusted R-squared:  0.7415
## F-statistic: 483.8 on 3 and 502 DF,  p-value: < 2.2e-16
  • Compare the R² of both models. Does the transformation improve fit?

5. Model Performance & Cross-Validation

set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(medv ~ rm + lstat + crim, data = boston_data, method = "lm", trControl = train_control)
cv_model
## Linear Regression
##
## 506 samples
##   3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 455, 456, 456, 456, 456, 456, ...
## Resampling results:
##
##   RMSE      Rsquared   MAE
##   5.487973  0.6455425  3.921115
##
## Tuning parameter 'intercept' was held constant at a value of TRUE

6. Homework Assignment (10 Points)

Answer the following questions and submit your answer in a nice R Markdown file.

Part 1: Model Interpretation (4 Points)

  1. What does the coefficient for rm mean in the original model? (1 point)
  • The coefficient for rm means that for each additional room in a house, the median home value increases by coefficient units, holding other variables constant. In this case, the coefficient is 5.2169, indicated with each additional room, housing value on average increase 5.21 units, with holding other predictors constant.
  1. How does lstat impact median home value, and why? (1 point)
  • The coefficient of the lstat is less than 0, which means that the percentage of the lower-income residents is negatively correlate with the median home values. The residents with lower income tends to live in the neighborhood with lower home values.
  • This negative relationship is expected, because the lower-income residents may not be able to afford expensive houses, so they tend to live in the neighborhood with lower home values.
  1. Is crime rate (crim) statistically significant? Justify using the p-value. (1 point)
  • Yes, the crime rate crim is statistically significant.
  • The p-value for the crime rate coefficient is 0.00139, which is below the 0.05. This indicates that the crime rate has a statistically significant impact on median home values.
  1. How well does the original model explain home values (interpret R² and Adjusted R²)? (1 point)
  • The original model explains 64.6% of the variance in median home values, as indicated by the R² value. The adjusted R² value accounts for the number of predictors in the model. In this case, the adjusted R² value indicates that the model explains 64.4% of the variance in median home values. Overall, the model has a good fit, but still one-third of the variation in home values has not explained by the model.

Part 2: Assumption Checks & Model Improvement (4 Points)

  1. Based on the residual diagnostics, are there any violations of OLS assumptions? (1 point)
  • The residual vs fitted value plot shows a slight pattern, indicating that the relationship between the predictors and the dependent variable may not be linear. The ideal plot should have no pattern and a straight red-line along 0.

  • The Q-Q plot shows that the residuals are not perfectly normally distributed, especially at the high end. However, in considering the outliers, the residuals are generally normally distributed.

  • The scale-location plot shows that the residuals are general homoscedastic, with few outliers around 20. Residuals are generally spread out evenly across the range of fitted values.

In summary,the models met the homoscedasticity and normality of the residuals assumptions, but there is a slight violation of the linearity assumption. The multicollinearity check will be conducted via the VIF test.

  1. What does the VIF test indicate about multicollinearity? (1 point)
  • Based on the VIF test, there are no obvious multicollinearity issues in the model. All VIF values are below 5, indicating that the predictors are not highly correlated with each other.
  • This suggests that the model met the OLS regression assumption of the no multicollinearity among predictors.
  1. After log-transforming lstat, does model performance improve? Explain. (1 point)

Yes, after log-transforming lstat, the model performance improve. The adjusted R² value of the log-transformed model is 0.7415, which is higher than the original model’s adjusted R² value of 0.6437. This indicate the new log-transformed model explains 74.15% of the variance in logged median household values, while the original model could only explain 64.37% of the variance in median home values. The log-transformed model has a better fit than the original model.

  1. Compare RMSE from cross-validation to the model’s residual standard error. Which suggests better predictive performance? (1 point)

The RMSE from cross-validation is 5.48, while the model’s residual standard error is 5.49. The cross validation has a lower RMSE than the original model residual standard error, which suggests that the cross-validation model has a better predictive performance than the original model. The cross-validation was trained on a portion of the data and tested on the remaining data, which avoid the overfitting issue and provide a more accurate estimate of the model’s predictive performance.

Part 3: Expanding the Model (2 Points)

  1. Add dis (distance to employment centers) to the model. Does it improve fit? (1 point)
model_dis <- lm(medv ~ rm + lstat + crim + dis, data = boston_data)
summary(model_dis)
##
## Call:
## lm(formula = medv ~ rm + lstat + crim + dis, data = boston_data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -19.006  -3.099  -1.047   1.885  26.571
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.23065    3.32214   0.671    0.502
## rm           4.97649    0.43885  11.340  < 2e-16 ***
## lstat       -0.66174    0.05101 -12.974  < 2e-16 ***
## crim        -0.12810    0.03209  -3.992 7.53e-05 ***
## dis         -0.56321    0.13542  -4.159 3.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.403 on 501 degrees of freedom
## Multiple R-squared:  0.6577, Adjusted R-squared:  0.6549
## F-statistic: 240.6 on 4 and 501 DF,  p-value: < 2.2e-16
vif(model_dis)
##       rm    lstat     crim      dis
## 1.645020 2.295404 1.318232 1.406845

Yes, adding the dis variable to the model improve the fit. The adjusted R² value of the new model is 0.6549, which is higher than the original model. The new model explain more variance in median home values than the original model, which improve the model fits.

  1. Try another predictor from the dataset that you think might be relevant. Justify why you selected it and interpret its impact on the model. (1 point)
model_new <- lm(medv ~ rm + lstat + crim + +dis+ age, data = boston_data)
summary(model_new)
##
## Call:
## lm(formula = medv ~ rm + lstat + crim + +dis + age, data = boston_data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -19.024  -3.101  -1.004   1.847  27.474
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.58935    3.38683   1.060   0.2897
## rm           5.10405    0.44261  11.532  < 2e-16 ***
## lstat       -0.61926    0.05541 -11.176  < 2e-16 ***
## crim        -0.13018    0.03202  -4.065 5.57e-05 ***
## dis         -0.77733    0.17466  -4.450 1.06e-05 ***
## age         -0.02738    0.01416  -1.933   0.0538 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.388 on 500 degrees of freedom
## Multiple R-squared:  0.6602, Adjusted R-squared:  0.6568
## F-statistic: 194.3 on 5 and 500 DF,  p-value: < 2.2e-16
vif(model_new)
##       rm    lstat     crim      dis      age
## 1.682412 2.723742 1.319717 2.353138 2.765585
  • I added the age variable to the model, because I assume that the housing age has potential relationship with the median housing value.

  • The coefficient of the age variable is -0.1237, which means that for each additional year of age, the average median home value decreases by 0.027 unit, holding other variables constant. The p-value of the age variable is 0.0538, which is slightly above the 0.05 threshold. This indicates that the age variable is not statistically significant in predicting median home values.

  • However, the adjusted R² value of the new model is 0.6568, which is higher than the original model’s adjusted R² value of 0.6549. The new model explain more variance in median home values than the original model.

Overall, as the age variable is not statistically significant, it may not be a good predictor for the median home values. Adding age variable, only improve the adjusted R² value by 0.0019. I may consider removing the age variable from the model.