Zillow Home Value Index Analysis with PDFM Embeddings

Comparing Linear, Ridge, and Lasso Regression Models

Author

Zhanchao Yang

Introduction

This document demonstrates the application of Population Dynamics Foundation Model (PDFM) embeddings to predict Zillow Home Value Index (ZHVI) data. PDFM embeddings are 330-dimensional vector representations that capture complex spatial and demographic patterns.

Objectives

  1. Load and visualize Zillow Home Value Index data with geospatial mapping
  2. Join PDFM embeddings with home value data
  3. Build regression models to predict home values:
    • Linear Regression (baseline)
    • Ridge Regression (L2 regularization)
    • Lasso Regression (L1 regularization with feature selection)
  4. Evaluate and visualize model performance

Why Ridge and Lasso Regression?

Ridge and Lasso regression are particularly useful when working with high-dimensional embeddings (330 features) because they:

  • Ridge Regression (L2):
    • Penalizes large coefficients to prevent overfitting
    • Keeps all features but shrinks their impact
    • Performs well when many features contribute to the outcome
  • Lasso Regression (L1):
    • Performs automatic feature selection by shrinking some coefficients to zero
    • Identifies the most important embedding dimensions
    • Produces sparse models that are easier to interpret
  • Both:
    • Use cross-validation to optimize the regularization parameter (lambda)
    • More computationally efficient than stepwise regression
    • Provide better generalization on unseen data

Setup and Data Loading

Load Required Libraries

Code
# Data manipulation and analysis
library(tidyverse)
library(readr)

# Geospatial data handling
library(sf)
library(leaflet)

# Machine learning and modeling
library(caret)
library(MASS)  # For stepwise regression
library(glmnet)  # For Ridge and Lasso regression

# Model evaluation
library(Metrics)
library(ggplot2)

# For table formatting
library(knitr)
library(kableExtra)

# Set random seed for reproducibility
set.seed(42)

Download Zillow Home Value Index Data

Code
# Download ZHVI data
zhvi <- read.csv("https://github.com/opengeos/datasets/releases/download/us/zillow_home_value_index_by_county.csv")

Load and Prepare ZHVI Data

Code
# constuct correct State FIPS code and Municipal FIPS code with leading zeros
zhvi_df <- zhvi %>%
  mutate(
    StateCodeFIP = str_pad(as.character(StateCodeFIPS), width = 2, side = "left", pad = "0"),
    MunicipalCodeFIP = str_pad(as.character(MunicipalCodeFIPS), width = 3, side = "left", pad = "0")
  )
# Create place identifier
zhvi_df <- zhvi_df %>%
  mutate(
    place = paste0("geoId/", StateCodeFIP, MunicipalCodeFIP)
  )

Note: The place column creates a unique identifier for each county by combining state and municipal FIPS codes, which will be used to join with geospatial and embedding data.

Geospatial Data Integration

Load County Geometries

Code
county_gdf <- st_read("https://github.com/zyang91/Google-Embedding-tutorial/releases/download/v2.0.0/county.geojson",quiet=TRUE)

couty_gdf <- county_gdf %>%
  dplyr::select(place)

Join ZHVI with County Geometries

Code
zhvi_county_gdf<- county_gdf %>%
  inner_join(
    zhvi_df,
    by = c("place" = "place")
  )

Visualizing Home Values

Prepare Data for Visualization

Code
# Select specific date column for visualization
target_date <- "X2024.10.31"
viz_gdf <- zhvi_county_gdf %>%
  dplyr::select(RegionName, State, all_of(target_date), geometry)

Create 2D Choropleth Map

Code
# Create interactive map with Leaflet
pal <- colorNumeric(
  palette = "Blues",
  domain = viz_gdf[[target_date]],
  na.color = "transparent"
)

leaflet(viz_gdf) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal(get(target_date)),
    fillOpacity = 0.7,
    color = "white",
    weight = 1,
    popup = ~paste0(
      "<strong>", RegionName, ", ", State, "</strong><br>",
      "Home Value: $", format(get(target_date), big.mark = ",")
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal,
    values = ~get(target_date),
    title = "Zillow Home Median Value",
    opacity = 1
  )

Note: This creates an interactive map where users can hover over counties to see home values. The blue color gradient represents the magnitude of home values.

PDFM Embeddings Integration

Load PDFM Embeddings

Code
# Load pre-computed PDFM embeddings
embeddings <- read_csv("https://github.com/zyang91/Google-Embedding-tutorial/releases/download/v2.0.0/county_embeddings.csv")

About PDFM Embeddings: These 330-dimensional vectors encode complex spatial patterns including:

  • Population mobility patterns
  • Search behavior trends
  • Local economic activity indicators
  • Environmental conditions
  • Demographic characteristics

Visualize Single Embedding Feature

Code
# Join embeddings with county geometries
df_embed <- county_gdf %>%
  inner_join(embeddings,
    by = "place"
  )

# Select one embedding feature to visualize
feature_col <- "feature329"
viz_embed <- df_embed %>%
  dplyr::select(state, all_of(feature_col), geometry)
Code
# Create map
pal_embed <- colorNumeric(
  palette = "Blues",
  domain = viz_embed[[feature_col]],
  na.color = "transparent"
)

leaflet(viz_embed) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal_embed(get(feature_col)),
    fillOpacity = 0.7,
    color = "white",
    weight = 1,
    popup = ~paste0(
      "<strong>", state, "</strong><br>",
      feature_col, ": ", round(get(feature_col), 4)
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal_embed,
    values = ~get(feature_col),
    title = feature_col,
    opacity = 1
  )

Note: Each of the 330 embedding features captures different spatial patterns. Feature329 is visualized here as an example.

Regression Modeling

Prepare Training Data

Code
# Join ZHVI with embeddings
data <- zhvi_df %>%
  inner_join(
    embeddings,
    by = "place"
  )

# Define embedding features and target variable
embedding_features <- paste0("feature", 0:329)
target_label <- "X2024.10.31"

# Remove rows with missing target values
data <- data %>%
  filter(!is.na(get(target_label)))

# Select only features and target for modeling
modeling_data <- data %>%
  dplyr::select(all_of(c(embedding_features, target_label)))
modeling_data <- modeling_data %>%
  mutate(index = row_number())
# Split into training and testing sets (80/20 split)
train_indices <- createDataPartition(modeling_data$index, p = 0.8, list = FALSE)
train_data <- modeling_data[train_indices, ]
test_data <- modeling_data[-train_indices, ]

Test data:604 Train data:2431

Model 1: Linear Regression (Baseline)

Code
# refine train data
train_data <- train_data %>%
  dplyr::select(-index)

train_data<- train_data %>%
  rename(target = X2024.10.31)

# Fit linear regression model using all features
lr_model <- lm(target ~ ., data = train_data)

summary(lr_model)

Call:
lm(formula = target ~ ., data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max
-376020  -29025    -988   27176  758839

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.923e+05  3.417e+04   8.554  < 2e-16 ***
feature0    -6.484e+04  4.664e+04  -1.390 0.164591
feature1    -1.890e+04  5.206e+03  -3.629 0.000291 ***
feature2    -9.262e+03  6.099e+03  -1.519 0.129003
feature3    -1.944e+02  5.940e+03  -0.033 0.973901
feature4    -1.000e+03  5.509e+03  -0.182 0.855949
feature5    -3.751e+03  4.974e+03  -0.754 0.450865
feature6    -5.508e+03  5.543e+03  -0.994 0.320564
feature7    -1.079e+04  5.613e+03  -1.922 0.054740 .
feature8    -2.761e+03  5.373e+03  -0.514 0.607335
feature9    -1.680e+04  5.227e+03  -3.215 0.001326 **
feature10    1.849e+03  5.412e+03   0.342 0.732617
feature11    2.269e+03  5.422e+03   0.418 0.675678
feature12    2.725e+04  5.481e+03   4.971 7.19e-07 ***
feature13    2.693e+04  5.759e+03   4.677 3.09e-06 ***
feature14    2.102e+03  5.674e+03   0.370 0.711086
feature15    2.586e+04  5.688e+03   4.545 5.80e-06 ***
feature16    1.292e+04  6.260e+03   2.064 0.039172 *
feature17   -1.871e+04  4.968e+03  -3.765 0.000171 ***
feature18   -1.067e+04  5.602e+03  -1.905 0.056909 .
feature19   -2.534e+04  5.272e+03  -4.806 1.65e-06 ***
feature20   -6.973e+03  4.941e+03  -1.411 0.158381
feature21   -1.663e+04  5.008e+03  -3.321 0.000912 ***
feature22    9.916e+03  5.768e+03   1.719 0.085748 .
feature23    3.453e+04  5.701e+03   6.056 1.65e-09 ***
feature24   -1.408e+04  4.706e+04  -0.299 0.764823
feature25    2.313e+03  5.249e+03   0.441 0.659473
feature26   -7.118e+02  5.816e+03  -0.122 0.902602
feature27    1.906e+03  5.959e+03   0.320 0.749089
feature28    1.033e+04  6.071e+03   1.701 0.089122 .
feature29    1.481e+04  5.775e+03   2.564 0.010431 *
feature30   -6.455e+03  5.632e+03  -1.146 0.251873
feature31   -3.879e+02  5.539e+03  -0.070 0.944175
feature32    6.302e+03  5.052e+03   1.247 0.212370
feature33    3.139e+04  5.408e+03   5.803 7.49e-09 ***
feature34   -3.321e+03  5.220e+03  -0.636 0.524703
feature35    2.918e+04  6.084e+03   4.797 1.73e-06 ***
feature36    1.710e+03  5.249e+03   0.326 0.744653
feature37   -5.855e+03  5.742e+03  -1.020 0.308033
feature38   -1.128e+04  5.782e+03  -1.951 0.051186 .
feature39   -2.087e+03  4.609e+03  -0.453 0.650734
feature40   -1.478e+04  4.830e+03  -3.060 0.002242 **
feature41   -2.229e+04  5.793e+03  -3.848 0.000123 ***
feature42   -9.822e+03  5.209e+03  -1.886 0.059497 .
feature43   -3.839e+03  5.325e+03  -0.721 0.471002
feature44    2.298e+04  5.672e+03   4.052 5.26e-05 ***
feature45   -1.500e+04  5.647e+03  -2.656 0.007962 **
feature46   -9.565e+02  5.391e+03  -0.177 0.859182
feature47    9.350e+03  5.738e+03   1.630 0.103354
feature48   -2.983e+03  5.205e+03  -0.573 0.566709
feature49   -4.090e+03  5.177e+03  -0.790 0.429584
feature50    1.425e+04  5.577e+03   2.555 0.010682 *
feature51    5.759e+03  5.474e+03   1.052 0.292947
feature52   -1.329e+04  5.300e+03  -2.508 0.012231 *
feature53    1.104e+04  5.498e+03   2.008 0.044747 *
feature54   -4.584e+03  5.339e+03  -0.859 0.390698
feature55   -1.421e+03  5.824e+03  -0.244 0.807308
feature56   -9.396e+03  5.481e+03  -1.714 0.086634 .
feature57   -3.887e+04  5.274e+03  -7.371 2.42e-13 ***
feature58   -6.546e+03  5.264e+03  -1.244 0.213803
feature59   -1.552e+04  5.285e+03  -2.937 0.003347 **
feature60    1.546e+04  5.352e+03   2.888 0.003912 **
feature61    1.253e+04  5.852e+03   2.140 0.032442 *
feature62    7.411e+02  5.660e+03   0.131 0.895836
feature63    3.691e+03  5.993e+03   0.616 0.538064
feature64    2.213e+03  5.341e+03   0.414 0.678596
feature65   -2.288e+04  5.608e+03  -4.080 4.67e-05 ***
feature66   -1.797e+04  5.094e+03  -3.528 0.000428 ***
feature67    1.293e+04  5.624e+03   2.299 0.021606 *
feature68   -1.094e+04  5.291e+03  -2.068 0.038740 *
feature69    5.633e+03  5.360e+03   1.051 0.293468
feature70    3.634e+03  5.024e+03   0.723 0.469615
feature71   -2.033e+04  5.520e+03  -3.683 0.000236 ***
feature72    1.429e+03  5.157e+03   0.277 0.781714
feature73   -1.893e+04  5.822e+03  -3.252 0.001166 **
feature74   -1.495e+04  5.001e+03  -2.989 0.002830 **
feature75   -3.786e+03  5.072e+03  -0.747 0.455404
feature76    1.450e+05  4.749e+04   3.053 0.002293 **
feature77    4.332e+04  5.854e+03   7.399 1.97e-13 ***
feature78    1.631e+04  5.466e+03   2.984 0.002877 **
feature79    2.559e+04  5.962e+03   4.292 1.85e-05 ***
feature80   -1.601e+03  5.732e+03  -0.279 0.780022
feature81    3.645e+04  5.842e+03   6.240 5.28e-10 ***
feature82    3.046e+04  5.481e+03   5.558 3.08e-08 ***
feature83   -4.520e+03  5.843e+03  -0.774 0.439251
feature84   -1.994e+04  5.488e+03  -3.634 0.000286 ***
feature85    2.423e+04  5.692e+03   4.257 2.17e-05 ***
feature86    1.020e+04  6.270e+03   1.626 0.104003
feature87   -3.117e+04  5.528e+03  -5.638 1.95e-08 ***
feature88   -8.138e+03  5.295e+03  -1.537 0.124499
feature89   -1.740e+04  4.782e+03  -3.638 0.000281 ***
feature90   -3.430e+03  6.131e+03  -0.559 0.575896
feature91    3.688e+04  5.434e+03   6.788 1.47e-11 ***
feature92   -1.167e+04  5.197e+03  -2.246 0.024787 *
feature93   -3.999e+03  5.586e+03  -0.716 0.474151
feature94    3.826e+04  5.701e+03   6.712 2.46e-11 ***
feature95   -3.883e+04  6.359e+03  -6.107 1.21e-09 ***
feature96   -1.535e+04  5.856e+03  -2.621 0.008823 **
feature97    2.035e+03  5.889e+03   0.346 0.729740
feature98   -1.053e+04  5.896e+03  -1.787 0.074162 .
feature99    7.540e+03  5.979e+03   1.261 0.207426
feature100  -1.756e+04  5.719e+03  -3.070 0.002167 **
feature101   1.548e+04  5.441e+03   2.846 0.004473 **
feature102   1.352e+04  5.656e+03   2.390 0.016946 *
feature103   2.439e+04  5.725e+03   4.260 2.14e-05 ***
feature104  -2.295e+04  5.329e+03  -4.306 1.74e-05 ***
feature105  -1.771e+03  5.465e+03  -0.324 0.745878
feature106  -7.883e+03  5.229e+03  -1.508 0.131805
feature107  -1.385e+04  5.449e+03  -2.542 0.011097 *
feature108   4.519e+04  6.781e+03   6.664 3.39e-11 ***
feature109  -1.057e+04  5.632e+03  -1.877 0.060636 .
feature110   4.495e+03  5.690e+03   0.790 0.429661
feature111   2.867e+04  5.947e+03   4.821 1.53e-06 ***
feature112   5.693e+03  5.667e+03   1.005 0.315212
feature113  -4.713e+03  5.597e+03  -0.842 0.399782
feature114   4.099e+03  5.926e+03   0.692 0.489179
feature115  -4.339e+03  5.502e+03  -0.789 0.430451
feature116   6.155e+03  5.765e+03   1.068 0.285803
feature117  -5.910e+03  4.294e+03  -1.376 0.168814
feature118   1.899e+03  5.149e+03   0.369 0.712390
feature119  -3.195e+03  5.244e+03  -0.609 0.542345
feature120  -1.044e+03  5.787e+03  -0.180 0.856806
feature121  -6.598e+03  5.689e+03  -1.160 0.246281
feature122  -3.457e+04  6.160e+03  -5.613 2.25e-08 ***
feature123  -4.675e+03  5.486e+03  -0.852 0.394148
feature124  -2.770e+04  5.094e+03  -5.437 6.06e-08 ***
feature125   3.368e+04  6.147e+03   5.478 4.81e-08 ***
feature126   1.053e+04  5.869e+03   1.793 0.073063 .
feature127   2.075e+04  5.508e+03   3.767 0.000170 ***
feature128  -2.126e+01  4.490e+03  -0.005 0.996223
feature129   1.579e+04  1.184e+04   1.333 0.182739
feature130  -5.363e+03  4.978e+03  -1.077 0.281466
feature131  -3.811e+03  3.341e+03  -1.141 0.254195
feature132  -1.194e+04  8.623e+03  -1.385 0.166323
feature133   5.748e+03  5.745e+03   1.000 0.317201
feature134  -2.098e+03  7.242e+04  -0.029 0.976895
feature135   7.097e+03  1.288e+04   0.551 0.581683
feature136  -2.899e+03  1.341e+04  -0.216 0.828872
feature137  -8.275e+04  7.399e+04  -1.118 0.263526
feature138   1.743e+05  6.397e+04   2.725 0.006479 **
feature139   5.676e+04  6.966e+04   0.815 0.415254
feature140   4.231e+04  1.256e+04   3.369 0.000768 ***
feature141  -9.457e+03  1.035e+04  -0.913 0.361210
feature142   5.172e+03  6.141e+03   0.842 0.399777
feature143  -9.761e+03  9.786e+03  -0.997 0.318652
feature144   5.040e+04  1.237e+04   4.073 4.81e-05 ***
feature145  -1.329e+05  5.913e+04  -2.247 0.024725 *
feature146   1.159e+04  1.070e+04   1.082 0.279169
feature147  -9.822e+02  4.805e+03  -0.204 0.838074
feature148   1.240e+04  7.181e+03   1.727 0.084338 .
feature149   4.676e+02  5.855e+03   0.080 0.936350
feature150   6.264e+04  5.951e+04   1.053 0.292622
feature151   4.345e+01  3.146e+03   0.014 0.988980
feature152  -8.972e+04  6.556e+04  -1.368 0.171318
feature153  -4.021e+04  1.546e+04  -2.601 0.009357 **
feature154  -2.750e+02  1.146e+04  -0.024 0.980862
feature155   9.481e+03  6.966e+03   1.361 0.173658
feature156   2.138e+03  7.523e+03   0.284 0.776314
feature157   1.342e+04  1.204e+04   1.114 0.265277
feature158  -9.103e+08  3.537e+09  -0.257 0.796893
feature159  -2.136e+04  1.403e+04  -1.522 0.128133
feature160  -3.596e+03  3.839e+03  -0.937 0.348997
feature161  -7.898e+01  6.365e+03  -0.012 0.990102
feature162  -1.023e+04  1.032e+04  -0.991 0.321875
feature163   5.567e+04  5.363e+04   1.038 0.299349
feature164  -7.599e+03  3.745e+03  -2.029 0.042607 *
feature165  -2.290e+04  7.699e+03  -2.974 0.002972 **
feature166   9.422e+02  5.764e+04   0.016 0.986959
feature167   1.304e+04  5.708e+04   0.228 0.819362
feature168   1.001e+05  6.673e+04   1.500 0.133863
feature169   2.110e+05  6.573e+04   3.210 0.001348 **
feature170   1.004e+05  5.518e+04   1.820 0.068971 .
feature171  -1.818e+04  1.071e+04  -1.698 0.089747 .
feature172  -1.067e+05  6.661e+04  -1.601 0.109450
feature173  -1.254e+04  5.471e+03  -2.292 0.021981 *
feature174  -6.760e+03  6.325e+03  -1.069 0.285276
feature175   1.317e+05  7.704e+04   1.710 0.087441 .
feature176  -2.983e+03  4.591e+03  -0.650 0.516003
feature177   1.624e+03  2.954e+03   0.550 0.582632
feature178   1.114e+04  1.027e+04   1.085 0.278185
feature179   1.347e+03  7.355e+03   0.183 0.854706
feature180  -3.947e+03  9.036e+03  -0.437 0.662332
feature181  -2.386e+04  6.382e+03  -3.739 0.000190 ***
feature182  -4.717e+03  8.251e+03  -0.572 0.567602
feature183   1.562e+03  3.317e+03   0.471 0.637671
feature184   1.546e+05  6.194e+04   2.496 0.012626 *
feature185  -7.612e+02  5.275e+03  -0.144 0.885278
feature186   5.186e+03  5.944e+03   0.873 0.382994
feature187   1.887e+04  9.275e+03   2.035 0.041977 *
feature188  -2.257e+03  5.644e+03  -0.400 0.689314
feature189   2.148e+03  5.713e+03   0.376 0.707012
feature190  -6.345e+03  1.668e+04  -0.380 0.703668
feature191  -9.327e+04  5.730e+04  -1.628 0.103732
feature192   9.891e+04  8.122e+04   1.218 0.223458
feature193   7.396e+03  6.077e+03   1.217 0.223749
feature194  -1.573e+05  5.810e+04  -2.708 0.006824 **
feature195   6.023e+03  5.212e+03   1.156 0.247937
feature196   2.299e+05  7.168e+04   3.208 0.001358 **
feature197   4.481e+04  5.742e+04   0.780 0.435231
feature198  -1.499e+02  9.767e+03  -0.015 0.987756
feature199  -7.827e+04  5.342e+04  -1.465 0.143052
feature200   3.919e+03  6.045e+03   0.648 0.516842
feature201  -1.371e+04  1.810e+04  -0.757 0.448931
feature202   1.300e+04  1.412e+04   0.921 0.357398
feature203  -1.620e+04  5.272e+04  -0.307 0.758642
feature204  -1.700e+03  7.334e+03  -0.232 0.816778
feature205  -2.272e+04  5.860e+04  -0.388 0.698301
feature206   9.101e+04  6.805e+04   1.337 0.181262
feature207   5.414e+04  8.522e+03   6.352 2.59e-10 ***
feature208   1.615e+04  6.543e+04   0.247 0.805118
feature209  -4.616e+04  1.335e+04  -3.458 0.000555 ***
feature210  -6.910e+03  9.694e+03  -0.713 0.476057
feature211   2.386e+03  1.503e+04   0.159 0.873891
feature212  -3.547e+04  1.527e+04  -2.323 0.020285 *
feature213  -1.713e+05  5.824e+04  -2.941 0.003312 **
feature214  -3.778e+03  5.138e+03  -0.735 0.462259
feature215  -1.127e+03  5.238e+03  -0.215 0.829712
feature216  -3.234e+04  6.384e+04  -0.507 0.612515
feature217  -9.149e+03  4.123e+03  -2.219 0.026602 *
feature218  -2.571e+04  1.028e+04  -2.501 0.012444 *
feature219   8.461e+03  1.403e+04   0.603 0.546586
feature220   3.763e+03  1.013e+04   0.371 0.710341
feature221  -1.667e+04  2.381e+04  -0.700 0.483959
feature222   2.652e+04  1.241e+04   2.136 0.032775 *
feature223   7.303e+04  6.008e+04   1.215 0.224340
feature224  -7.310e+04  8.271e+04  -0.884 0.376864
feature225  -3.368e+04  1.483e+04  -2.272 0.023210 *
feature226  -1.314e+05  5.742e+04  -2.289 0.022204 *
feature227  -8.502e+03  1.225e+04  -0.694 0.487563
feature228  -9.595e+03  7.501e+03  -1.279 0.200999
feature229   8.698e+03  8.375e+04   0.104 0.917285
feature230   8.051e+04  1.557e+04   5.172 2.53e-07 ***
feature231   1.781e+04  1.614e+04   1.103 0.270096
feature232   1.110e+02  6.524e+03   0.017 0.986421
feature233  -2.238e+04  2.118e+04  -1.056 0.290870
feature234   2.052e+04  1.825e+04   1.124 0.261094
feature235  -1.374e+04  4.907e+03  -2.799 0.005166 **
feature236  -2.629e+03  9.922e+03  -0.265 0.791029
feature237  -4.339e+03  4.707e+03  -0.922 0.356798
feature238   3.852e+03  6.369e+04   0.060 0.951786
feature239  -1.894e+04  1.028e+04  -1.843 0.065541 .
feature240  -8.501e+03  5.459e+03  -1.557 0.119549
feature241   5.802e+03  6.435e+03   0.902 0.367386
feature242  -6.039e+03  7.393e+03  -0.817 0.414139
feature243   3.022e+04  6.481e+04   0.466 0.641033
feature244   2.110e+04  1.107e+04   1.906 0.056781 .
feature245   3.703e+04  5.322e+04   0.696 0.486665
feature246  -6.135e+03  8.861e+03  -0.692 0.488770
feature247  -1.546e+05  6.139e+04  -2.518 0.011892 *
feature248   4.183e+03  5.065e+03   0.826 0.408944
feature249   1.015e+04  1.532e+04   0.663 0.507507
feature250   3.404e+04  1.347e+04   2.527 0.011570 *
feature251  -4.872e+03  9.242e+03  -0.527 0.598160
feature252  -3.918e+04  1.377e+04  -2.845 0.004491 **
feature253   1.164e+05  6.168e+04   1.888 0.059203 .
feature254  -1.729e+05  7.181e+04  -2.407 0.016154 *
feature255  -9.907e+03  6.626e+04  -0.150 0.881151
feature256   1.004e+03  2.544e+03   0.394 0.693312
feature257   3.717e+03  3.254e+03   1.142 0.253439
feature258  -2.176e+03  3.304e+03  -0.659 0.510255
feature259   1.809e+03  3.847e+03   0.470 0.638314
feature260   7.726e+02  3.490e+03   0.221 0.824835
feature261  -1.868e+03  2.728e+03  -0.685 0.493494
feature262  -8.123e+03  3.373e+03  -2.408 0.016105 *
feature263   8.999e+03  3.941e+03   2.283 0.022506 *
feature264  -6.545e+03  3.539e+03  -1.850 0.064503 .
feature265   3.401e+03  3.532e+03   0.963 0.335796
feature266  -8.592e+03  3.202e+03  -2.684 0.007340 **
feature267   5.477e+04  5.483e+04   0.999 0.317957
feature268   9.102e+02  3.822e+03   0.238 0.811766
feature269  -9.589e+03  3.967e+03  -2.417 0.015731 *
feature270   4.323e+03  3.908e+03   1.106 0.268770
feature271   1.410e+03  2.833e+03   0.498 0.618650
feature272   3.846e+03  3.838e+03   1.002 0.316448
feature273   1.772e+03  3.687e+03   0.481 0.630913
feature274   2.348e+03  3.554e+03   0.661 0.508825
feature275  -3.518e+03  3.437e+03  -1.024 0.306149
feature276   9.059e+02  3.231e+03   0.280 0.779242
feature277   5.254e+03  2.862e+03   1.836 0.066550 .
feature278   3.607e+03  3.234e+03   1.115 0.264834
feature279  -6.503e+03  3.908e+03  -1.664 0.096272 .
feature280   2.292e+03  2.957e+03   0.775 0.438402
feature281  -4.644e+03  3.702e+03  -1.255 0.209744
feature282  -2.065e+03  3.490e+03  -0.592 0.554038
feature283   2.242e+03  3.732e+03   0.601 0.548041
feature284   4.647e+03  3.343e+03   1.390 0.164673
feature285   3.772e+02  3.212e+03   0.117 0.906533
feature286   3.963e+03  3.169e+03   1.251 0.211194
feature287  -2.063e+03  1.536e+03  -1.343 0.179493
feature288  -6.668e+03  2.911e+03  -2.290 0.022104 *
feature289  -1.099e+04  2.774e+03  -3.963 7.66e-05 ***
feature290  -5.420e+03  2.889e+03  -1.876 0.060838 .
feature291   8.285e+03  3.339e+03   2.481 0.013182 *
feature292   4.481e+03  4.070e+03   1.101 0.270970
feature293  -5.847e+03  3.282e+03  -1.782 0.074932 .
feature294  -4.980e+03  3.438e+03  -1.449 0.147595
feature295   4.927e+03  3.097e+03   1.591 0.111744
feature296  -8.711e+03  2.512e+03  -3.467 0.000537 ***
feature297  -5.638e+03  3.892e+03  -1.449 0.147603
feature298   1.434e+03  3.150e+03   0.455 0.648981
feature299  -5.802e+03  3.719e+03  -1.560 0.118883
feature300  -5.712e+03  3.124e+03  -1.829 0.067613 .
feature301  -2.137e+03  2.539e+03  -0.842 0.400011
feature302   5.517e+03  2.826e+03   1.952 0.051058 .
feature303  -1.182e+04  4.186e+03  -2.823 0.004805 **
feature304   1.170e+04  3.105e+03   3.769 0.000169 ***
feature305   5.272e+03  3.143e+03   1.678 0.093569 .
feature306   4.727e+03  4.769e+03   0.991 0.321739
feature307  -7.177e+03  2.558e+03  -2.806 0.005070 **
feature308   9.804e+03  3.746e+03   2.617 0.008936 **
feature309   1.822e+04  3.847e+03   4.736 2.32e-06 ***
feature310   2.236e+02  2.331e+03   0.096 0.923611
feature311   1.456e+02  3.988e+03   0.037 0.970887
feature312  -5.665e+03  3.490e+03  -1.623 0.104743
feature313  -2.280e+03  3.457e+03  -0.660 0.509605
feature314  -1.620e+03  2.798e+03  -0.579 0.562686
feature315  -7.906e+01  2.455e+03  -0.032 0.974313
feature316  -6.722e+03  3.591e+03  -1.872 0.061374 .
feature317  -9.344e+03  3.307e+03  -2.826 0.004760 **
feature318  -9.038e+02  3.091e+03  -0.292 0.770022
feature319  -1.135e+03  3.687e+03  -0.308 0.758185
feature320  -1.835e+02  3.665e+03  -0.050 0.960071
feature321  -8.836e+00  3.030e+03  -0.003 0.997673
feature322   5.219e+03  2.947e+03   1.771 0.076683 .
feature323  -6.288e+02  3.378e+03  -0.186 0.852329
feature324   1.704e+03  3.356e+03   0.508 0.611776
feature325  -5.609e+03  3.899e+03  -1.438 0.150457
feature326   1.778e+03  1.073e+03   1.658 0.097559 .
feature327  -2.830e+03  2.988e+03  -0.947 0.343671
feature328  -5.539e+04  5.781e+04  -0.958 0.338057
feature329  -3.916e+03  2.409e+03  -1.626 0.104145
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62540 on 2100 degrees of freedom
Multiple R-squared:  0.8829,    Adjusted R-squared:  0.8645
F-statistic:    48 on 330 and 2100 DF,  p-value: < 2.2e-16

On training data, adjusted R-squared: 0.8645, which indicates the model explains a high proportion of variance in home values.

Code
# Make predictions on test set
y_pred_lr <- predict(lr_model, newdata = test_data%>%dplyr::select(-index, -X2024.10.31))

y_test <- test_data$X2024.10.31

# Calculate evaluation metrics
lr_mae <- mae(y_test, y_pred_lr)
lr_rmse <- rmse(y_test, y_pred_lr)
lr_r2 <- cor(y_test, y_pred_lr)^2
# Display results
lr_results <- data.frame(
  Metric = c("MAE", "RMSE", "R²"),
  Value = c(
    round(lr_mae, 2),
    round(lr_rmse, 2),
    round(lr_r2, 4)
  )
)
kable(lr_results,
      caption = "Linear Regression Performance Metrics",
      col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Linear Regression Performance Metrics
Metric Value
MAE 42972.1200
RMSE 59355.7100
0.8565

Linear Regression Interpretation:

  • Uses all 330 embedding features
  • No feature selection - may include irrelevant features
  • Serves as baseline for comparison

Model 2: Stepwise Regression

Code
# Perform stepwise regression using AIC criterion
# Direction "both" allows both forward and backward selection
# step_model <- stepAIC(
#   lr_model,
#   direction = "both",
#   trace = TRUE  # Set to TRUE to see step-by-step selection
# )

Stepwise Regression Interpretation:

  • Automatically selects most predictive features using AIC (Akaike Information Criterion)
  • Balances model fit with complexity
  • Typically results in a more parsimonious model
  • Shows which embedding dimensions are most important for prediction
  • Not possible (it will compute over 54000 models, roughly takes around 6-7 days to finish)

Model 3: Ridge Regression

Code
# Prepare matrix format for glmnet (Ridge regression requires matrix input)
x_train <- as.matrix(train_data %>% dplyr::select(-target))
y_train <- train_data$target
x_test <- as.matrix(test_data %>% dplyr::select(-index, -X2024.10.31))

# Perform cross-validation to find optimal lambda
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 10)

# Fit Ridge regression model with optimal lambda
ridge_model <- glmnet(x_train, y_train, alpha = 0, lambda = cv_ridge$lambda.min)

# Make predictions on test set
y_pred_ridge <- predict(ridge_model, newx = x_test, s = cv_ridge$lambda.min)

# Calculate evaluation metrics
ridge_mae <- mae(y_test, y_pred_ridge)
ridge_rmse <- rmse(y_test, y_pred_ridge)
ridge_r2 <- cor(y_test, y_pred_ridge)^2

# Display results
ridge_results <- data.frame(
  Metric = c("MAE", "RMSE", "R²", "Optimal Lambda"),
  Value = c(
    round(ridge_mae, 2),
    round(ridge_rmse, 2),
    round(ridge_r2, 4),
    round(cv_ridge$lambda.min, 4)
  )
)

kable(ridge_results,
      caption = "Ridge Regression Performance Metrics",
      col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Ridge Regression Performance Metrics
Metric Value
MAE 40895.1000
RMSE 57471.6900
0.8606
Optimal Lambda 14210.2992

Ridge Regression Interpretation:

  • Uses L2 regularization to penalize large coefficients
  • Shrinks coefficients but keeps all 330 features (does not perform feature selection)
  • Lambda parameter controls the amount of regularization
  • Cross-validation used to find optimal lambda value
  • Helps prevent overfitting compared to standard linear regression

Model 4: Lasso Regression

Code
# Perform cross-validation to find optimal lambda for Lasso
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 10)

# Fit Lasso regression model with optimal lambda
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = cv_lasso$lambda.min)

# Make predictions on test set
y_pred_lasso <- predict(lasso_model, newx = x_test, s = cv_lasso$lambda.min)

# Calculate evaluation metrics
lasso_mae <- mae(y_test, y_pred_lasso)
lasso_rmse <- rmse(y_test, y_pred_lasso)
lasso_r2 <- cor(y_test, y_pred_lasso)^2

# Count number of non-zero coefficients (features selected)
lasso_coefs <- coef(lasso_model, s = cv_lasso$lambda.min)
n_features_selected <- sum(lasso_coefs != 0) - 1  # Subtract 1 for intercept

# Display results
lasso_results <- data.frame(
  Metric = c("MAE", "RMSE", "R²", "Optimal Lambda", "Features Selected"),
  Value = c(
    round(lasso_mae, 2),
    round(lasso_rmse, 2),
    round(lasso_r2, 4),
    round(cv_lasso$lambda.min, 4),
    n_features_selected
  )
)

kable(lasso_results,
      caption = "Lasso Regression Performance Metrics",
      col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Lasso Regression Performance Metrics
Metric Value
MAE 41944.9500
RMSE 58234.7900
0.8581
Optimal Lambda 587.1713
Features Selected 229.0000

Lasso Regression Interpretation:

  • Uses L1 regularization to penalize large coefficients
  • Performs automatic feature selection by shrinking some coefficients to exactly zero
  • Lambda parameter controls the amount of regularization
  • Cross-validation used to find optimal lambda value
  • Results in a sparse model with fewer features than Ridge regression
  • Helps identify the most important embedding dimensions for prediction

Model Comparison

Code
# Create comparison dataframe with all models
model_comparison <- data.frame(
  Model = c("Linear Regression", "Ridge Regression", "Lasso Regression"),
  MAE = c(
    round(lr_mae, 2),
    round(ridge_mae, 2),
    round(lasso_mae, 2)
  ),
  RMSE = c(
    round(lr_rmse, 2),
    round(ridge_rmse, 2),
    round(lasso_rmse, 2)
  ),
  R_squared = c(
    round(lr_r2, 4),
    round(ridge_r2, 4),
    round(lasso_r2, 4)
  ),
  Features_Used = c(
    330,
    330,
    n_features_selected
  )
)

kable(model_comparison,
      caption = "Comparison of Regression Models",
      col.names = c("Model", "MAE", "RMSE", "R²", "Features Used")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(which.min(model_comparison$RMSE), bold = TRUE, color = "white", background = "#4CAF50")
Comparison of Regression Models
Model MAE RMSE Features Used
Linear Regression 42972.12 59355.71 0.8565 330
Ridge Regression 40895.10 57471.69 0.8606 330
Lasso Regression 41944.95 58234.79 0.8581 229

Key Insights:

  • Compare which model performs better on unseen data
  • Ridge regression uses all features but with regularization to prevent overfitting
  • Lasso regression performs feature selection, using fewer features while maintaining accuracy
  • Feature reduction can lead to better interpretability and faster predictions
  • The best performing model (lowest RMSE) is highlighted in green

Visualization of Model Performance

Actual vs. Predicted Plot - Linear Regression

Code
# Create evaluation dataframe for linear regression
eval_df_lr <- data.frame(
  actual = y_test,
  predicted = y_pred_lr
)

# Plot
ggplot(eval_df_lr, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  coord_fixed(xlim = c(0, 1000000), ylim = c(0, 1000000)) +
  labs(
    title = "Linear Regression: Actual vs Predicted",
    subtitle = paste0("R² = ", round(lr_r2, 4)),
    x = "Actual Home Value ($)",
    y = "Predicted Home Value ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma)

Actual vs. Predicted Plot - Ridge Regression

Code
# Create evaluation dataframe for Ridge regression
eval_df_ridge <- data.frame(
  actual = y_test,
  predicted = as.vector(y_pred_ridge)
)

# Plot
ggplot(eval_df_ridge, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.5, color = "darkgreen") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  coord_fixed(xlim = c(0, 1000000), ylim = c(0, 1000000)) +
  labs(
    title = "Ridge Regression: Actual vs Predicted",
    subtitle = paste0("R² = ", round(ridge_r2, 4)),
    x = "Actual Home Value ($)",
    y = "Predicted Home Value ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma)

Interpretation of Scatter Plots:

  • Points along the red diagonal line indicate perfect predictions
  • Points above the line = model overestimates home value
  • Points below the line = model underestimates home value
  • Tighter clustering around the diagonal = better model performance

Actual vs. Predicted Plot - Lasso Regression

Code
# Create evaluation dataframe for Lasso regression
eval_df_lasso <- data.frame(
  actual = y_test,
  predicted = as.vector(y_pred_lasso)
)

# Plot
ggplot(eval_df_lasso, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.5, color = "darkorange") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  coord_fixed(xlim = c(0, 1000000), ylim = c(0, 1000000)) +
  labs(
    title = "Lasso Regression: Actual vs Predicted",
    subtitle = paste0("R² = ", round(lasso_r2, 4), " | Features: ", n_features_selected, "/330"),
    x = "Actual Home Value ($)",
    y = "Predicted Home Value ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma)

Spatial Visualization of Prediction Errors

Calculate Prediction Differences

Code
# Add test data indices back to identify counties
test_data_with_place <- modeling_data[-train_indices, ] %>%
  mutate(row_idx = row_number())

# Get the original place identifiers for test data
test_places <- data[test_data$index, "place"]

# Create comparison dataframe
prediction_comparison <- data.frame(
  place = test_places,
  actual = y_test,
  pred_lr = y_pred_lr,
  pred_ridge = as.vector(y_pred_ridge),
  pred_lasso = as.vector(y_pred_lasso)
) %>%
  mutate(
    diff_lr = pred_lr - actual,
    diff_ridge = pred_ridge - actual,
    diff_lasso = pred_lasso - actual
  )

Map Prediction Errors

Linear Regression Errors:

Code
# Join prediction differences with county geometries
error_map_data_lr <- county_gdf %>%
  inner_join(prediction_comparison, by = "place")
# Create color palette for differences (red = underestimate, blue = overestimate)
pal_diff_lr <- colorNumeric(
  palette = c("blue", "white", "red"),
  domain = c(-200000, 200000),
  na.color = "gray"
)
# Create interactive map showing Linear regression errors
leaflet(error_map_data_lr) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal_diff_lr(diff_lr),
    fillOpacity = 0.7,
    color = "white",
    weight = 1,
    popup = ~paste0(
      "<strong>County</strong><br>",
      "Actual: $", format(round(actual), big.mark = ","), "<br>",
      "Predicted (Linear): $", format(round(pred_lr), big.mark = ","), "<br>",
      "Difference: $", format(round(diff_lr), big.mark = ",")
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal_diff_lr,
    values = ~diff_lr,
    title = "Prediction Error<br>(Linear Model)",
    opacity = 1,
    labFormat = labelFormat(prefix = "$")
  )

Ridge Regression Errors:

Code
# Join prediction differences with county geometries
error_map_data_ridge <- county_gdf %>%
  inner_join(prediction_comparison, by = "place")
# Create color palette for differences (red = underestimate, blue = overestimate)
pal_diff_ridge <- colorNumeric(
  palette = c("blue", "white", "red"),
  domain = c(-200000, 200000),
  na.color = "gray"
)
# Create interactive map showing Ridge regression errors
leaflet(error_map_data_ridge) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal_diff_ridge(diff_ridge),
    fillOpacity = 0.7,
    color = "white",
    weight = 1,
    popup = ~paste0(
      "<strong>County</strong><br>",
      "Actual: $", format(round(actual), big.mark = ","), "<br>",
      "Predicted (Ridge): $", format(round(pred_ridge), big.mark = ","), "<br>",
      "Difference: $", format(round(diff_ridge), big.mark = ",")
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal_diff_ridge,
    values = ~diff_ridge,
    title = "Prediction Error<br>(Ridge Model)",
    opacity = 1,
    labFormat = labelFormat(prefix = "$")
  )

Lasso Regression Errors:

Code
# Join prediction differences with county geometries
error_map_data <- county_gdf %>%
  inner_join(prediction_comparison, by = "place")

# Create color palette for differences (red = underestimate, blue = overestimate)
pal_diff <- colorNumeric(
  palette = c("blue", "white", "red"),
  domain = c(-200000, 200000),
  na.color = "gray"
)

# Create interactive map showing Lasso regression errors
leaflet(error_map_data) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal_diff(diff_lasso),
    fillOpacity = 0.7,
    color = "white",
    weight = 1,
    popup = ~paste0(
      "<strong>County</strong><br>",
      "Actual: $", format(round(actual), big.mark = ","), "<br>",
      "Predicted (Lasso): $", format(round(pred_lasso), big.mark = ","), "<br>",
      "Difference: $", format(round(diff_lasso), big.mark = ",")
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal_diff,
    values = ~diff_lasso,
    title = "Prediction Error<br>(Lasso Model)",
    opacity = 1,
    labFormat = labelFormat(prefix = "$")
  )

Spatial Error Analysis:

  • Red areas: Model underestimates home values (actual > predicted)
  • Blue areas: Model overestimates home values (actual < predicted)
  • White areas: Predictions close to actual values
  • This spatial visualization can reveal geographic patterns in model performance
  • Systematic errors in specific regions may indicate missing spatial features or local market conditions not captured by embeddings

Summary and Conclusions

Key Findings

  1. PDFM Embeddings as Features: The 330-dimensional PDFM embeddings successfully capture spatial patterns relevant to home value prediction.

  2. Model Performance:

    • Linear regression provides a straightforward baseline using all features
    • Ridge regression uses L2 regularization to reduce overfitting while keeping all features
    • Lasso regression performs automatic feature selection through L1 regularization
  3. Regularization Benefits:

    • Reduced overfitting compared to standard linear regression
    • Ridge: Shrinks coefficients but maintains all 330 features
    • Lasso: Identifies most important embedding dimensions through feature selection
    • Cross-validation ensures optimal regularization strength (lambda)
    • Improved generalization to unseen data
  4. Spatial Patterns: Error maps reveal geographic variations in prediction accuracy, suggesting opportunities for:

    • Regional model calibration
    • Incorporation of additional local features
    • Investigation of market-specific factors

Methodological Considerations

Advantages of Using Embeddings:

  • Captures complex, non-linear relationships
  • Incorporates diverse data sources (mobility, search trends, environment)
  • Transfer learning from large-scale models
  • Rich spatial representation

Limitations:

  • Black-box nature makes interpretation challenging
  • Embeddings may encode biases from training data

Future Directions

  1. Advanced Models: Try ensemble methods (Random Forest, XGBoost) or neural networks
  2. Feature Engineering: Combine embeddings with traditional features (square footage, bedrooms, etc.)

References