Comparing Linear, Ridge, and Lasso Regression Models
Author
Zhanchao Yang
Introduction
This document demonstrates the application of Population Dynamics Foundation Model (PDFM) embeddings to predict Zillow Home Value Index (ZHVI) data. PDFM embeddings are 330-dimensional vector representations that capture complex spatial and demographic patterns.
Objectives
Load and visualize Zillow Home Value Index data with geospatial mapping
Join PDFM embeddings with home value data
Build regression models to predict home values:
Linear Regression (baseline)
Ridge Regression (L2 regularization)
Lasso Regression (L1 regularization with feature selection)
Evaluate and visualize model performance
Why Ridge and Lasso Regression?
Ridge and Lasso regression are particularly useful when working with high-dimensional embeddings (330 features) because they:
Ridge Regression (L2):
Penalizes large coefficients to prevent overfitting
Keeps all features but shrinks their impact
Performs well when many features contribute to the outcome
Lasso Regression (L1):
Performs automatic feature selection by shrinking some coefficients to zero
Identifies the most important embedding dimensions
Produces sparse models that are easier to interpret
Both:
Use cross-validation to optimize the regularization parameter (lambda)
More computationally efficient than stepwise regression
Provide better generalization on unseen data
Setup and Data Loading
Load Required Libraries
Code
# Data manipulation and analysislibrary(tidyverse)library(readr)# Geospatial data handlinglibrary(sf)library(leaflet)# Machine learning and modelinglibrary(caret)library(MASS) # For stepwise regressionlibrary(glmnet) # For Ridge and Lasso regression# Model evaluationlibrary(Metrics)library(ggplot2)# For table formattinglibrary(knitr)library(kableExtra)# Set random seed for reproducibilityset.seed(42)
# constuct correct State FIPS code and Municipal FIPS code with leading zeroszhvi_df <- zhvi %>%mutate(StateCodeFIP =str_pad(as.character(StateCodeFIPS), width =2, side ="left", pad ="0"),MunicipalCodeFIP =str_pad(as.character(MunicipalCodeFIPS), width =3, side ="left", pad ="0") )# Create place identifierzhvi_df <- zhvi_df %>%mutate(place =paste0("geoId/", StateCodeFIP, MunicipalCodeFIP) )
Note: The place column creates a unique identifier for each county by combining state and municipal FIPS codes, which will be used to join with geospatial and embedding data.
# Select specific date column for visualizationtarget_date <-"X2024.10.31"viz_gdf <- zhvi_county_gdf %>% dplyr::select(RegionName, State, all_of(target_date), geometry)