Keywords: Spatial Lag Regression, Spatial Error Regression, Geographically Weighted Regression, OLS
GitHub Repository: MUSA5000-Spatial-Correlation | Website
Philadelphia has experienced significant demographic and economic transformations over recent decades, leading to notable implications for its urban housing market. These shifts have resulted in variations in median house values, which serve not only as a reflection of the city’s economic health but also as a proxy for broader social and spatial dynamics. Increasing median house values may indicate an influx of higher-income residents or early stages of gentrification, whereas declining values can be symptomatic of disinvestment and economic decline. Given these dynamics, accurately forecasting median house values is vital for urban planners and policymakers who are tasked with promoting sustainable and equitable urban development.
In our previous study, Ordinary Least Squares (OLS) regression was used to explore the relationships between median house values, the dependent variable, and several key socio-economic predictors in Philadelphia. These predictors included educational attainment, vacancy rates, the proportion of detached single-family homes, and poverty rate. All these factors influenced the home prices in different ways.
Although OLS regression provides a foundational understanding of the relationships between the predictors and the dependent variables, it has limitations when applied to spatial data. One of the key assumptions of OLS regression is that observations are independent of each other and without spatial autocorrelation. However, in spatial data, observations that are geographically close often exhibit similarity, leading to spatial autocorrelation and violating the assumption of OLS. Violating the assumptions of OLS may lead to biased and inefficient estimates of the regression coefficients, and incorrect inferences about the relationships between the predictors and the dependent variable.
To address these limitations, this report employs advanced spatial regression techniques to predict median house values in Philadelphia. We use Spatial Lag Regression, Spatial Error Regression, and Geographically Weighted Regression (GWR) to account for spatial autocorrelation and spatial heterogeneity in the data. We examine whether the spatial regression models can accurately predict the home prices when compared to the Ordinary Least Squares (OLS) regression models. By utilizing these spatial techniques, this study aims to improve the accuracy of the initial OLS findings and provide a more comprehensive understanding of the socio-economic and spatial factors influencing housing values. These insights will support more effective policy interventions and urban development strategies aimed at achieving equitable and sustainable growth in Philadelphia.
Spatial autocorrelation describes the degree to which a variable is correlated with itself across space. It shows the relationship of values within a single variable at nearby locations, helping in understanding patterns of spatial distribution and identifying clusters or dispersions in spatial data. The concept of spatial autocorrelation is rooted in The First Law of Geography, which states:
“Everything is related to everything else, but near things are more related than distant things.”
This principle suggests that geographically proximate areas tend to exhibit similar characteristics due to shared environmental, economic, or social factors.
Spatial autocorrelation measures how much a variable in one location is influenced by values in nearby locations. If observations that are closer to each other in space have related values, spatial autocorrelation will be positive. While if observations that are closer to each other have markedly different values, spatial autocorrelation will be negative.
Moran’s I is a widely used method for measuring spatial autocorrelation. The formula for Moran’s I is:
\[ I = \frac{N}{\sum_{i} \sum_{j} w_{ij}} \times \frac{\sum_{i} \sum_{j} w_{ij} (X_i - \bar{X}) (X_j - \bar{X})}{\sum_{i} (X_i - \bar{X})^2} \]
where:
A Moran’s I value close to +1 indicates strong positive spatial autocorrelation (clusters of similar values). A value near -1 suggests strong negative spatial autocorrelation (dispersion). A value near 0 implies no spatial autocorrelation.
When dealing with spatial data, we use spatial weight matrices to define relationships between observations. Given \(n\) observations, we construct an \(n \times n\) matrix that summarizes all pairwise spatial relationships in the dataset. These matrices are essential for estimating spatial regression models and calculating spatial autocorrelation indices.
There are several ways to define spatial relationships within a weight matrix. Queen Contiguity Matrix assigns a weight of 1 if two regions share a border or a vertex, otherwise 0. Rook Contiguity Matrix assigns a weight of 1 if two regions share only a border, otherwise 0. Distance-based Matrix assigns weights based on the inverse distance between observations.
In this report, we use the Queen contiguity weight matrix, which considers all neighboring regions that share either a boundary or a vertex.
Although we only use the queen contiguity weight matrix in the report, statisticians always use multiple spatial weight matrices to check the robustness of the results. Since different spatial weights can capture spatial dependencies at various levels of granularity, it can make sure the results are not merely an artifact of the matrix you’re using.
To determine whether spatial autocorrelation is statistically significant, we conduct a hypothesis test:
Null Hypothesis (\(H_0\)): No spatial autocorrelation, which means that the spatial distribution of values follows a random pattern with no systematic clustering or dispersion. Each location’s value is independent of the values at its neighboring locations.
Alternative Hypothesis 1 (\(H_{a1}\)): Positive spatial autocorrelation, which means similar values tend to cluster together. High values are surrounded by other high values, and low values are surrounded by other low values, forming distinct spatial patterns.
Alternative Hypothesis 2 (\(H_{a2}\)): Negative spatial autocorrelation, which means similar values tend to disperse rather than clustered. High values are surrounded by low values and vice versa, leading to a checkerboard-like spatial distribution.
To test significance, we conduct random shuffling. Firstly, we randomly shuffle the variable values across spatial locations multiple times (999 permutations is used in the report). Then, we compute Moran’s I for each permuted dataset to generate a reference distribution. We compare the observed Moran’s I to this distribution to determine if it is extreme, concluding whether the observed clustering pattern is statistically meaningful rather than occurring by chance.
If the observed Moran’s I falls in the extreme tail of the simulated distribution, we reject the null hypothesis (H₀) in favor of the appropriate alternative hypothesis. A p-value less than 0.05 typically indicates significant spatial autocorrelation.
While global Moran’s I provides a single statistic for the entire study area, Local Indicators of Spatial Association (LISA) provides insights into the presence of spatial autocorrelation at individual locations.
To determine whether local spatial autocorrelation is statistically significant, we conduct a hypothesis test:
Significance tests for local Moran’s I are conducted using random shuffling to ensure that detected clusters are not merely due to random chance. This process follows the same approach as global Moran’s I but involves randomly reshuffling the values of the variable across the study area while keeping the value at location \(i\) constant. By comparing the observed local Moran’s I to the distribution of values from these random permutations, statistical significance can be assessed.
If the observed \(I_i\) is extremely high or low compared to the reshuffled values, it is considered significant. The pseudosignificance value is estimated by noting the rank of the actual \(I_i\) among the permutations. For instance, if the original \(I_i\) ranks as the 97th highest among 999 permutations, the estimated pseudosignificance is \(p \approx 0.097\).
To analyze the relationship between socioeconomic factors and median house values in Philadelphia, we often use OLS (Ordinary Least Squares) Regression. By examining these relationships, we aim to identify critical predictors of median housing values throughout Philadelphia and offer insights for decision-makers and community initiatives. The key assumptions of OLS regression include:
Linearity assumes that the relationship between the dependent variable and the predictors is linear.
Independence of Observations assumes that the observations are independent of each other. There should be no spatial or temporal or other forms of dependence in the data.
Homoscedasticity assumes that the variance of the residuals \(\epsilon\) is constant regardless of the values of each level of the predictors.
Normality of Residuals assumes that the residuals are normally distributed.
No Multicollinearity assumes that the predictors are not highly correlated with each other.
No Fewer than 10 Observations per Predictor assumes that there are at least 10 observations for each predictor in the model.
In our first assignment, we have already used OLS regression to access how vacancy rates, single-family housing percentage, educational attainment, and poverty rates influence median house values in Philadelphia. All predictors were statistically significant. The model’s R-squared was 0.66, which indicate the model explain 66% of the variance in house values.
However, some predictors exhibited non-linear patterns, and spatial autocorrelation suggested dependence among observations. For OLS regression, one of the vital assumptions of OLS regression is that observations are independent of each other. In spatial data, observations that are geographically close often exhibit similarity, leading to spatial autocorrelation and violating the independence assumption. When spatial autocorrelation is present, values of a variable in nearby areas are related rather than randomly distributed. We need further test the spatial autocorrelation and key assumptions of OLS regression in order to improve the model’s accuracy and reliability.
Furthermore, when data has a spatial component, the assumption of normality of residuals often fails to hold. In some cases, spatial autocorrelation does not significantly impact regression analysis. If the dependent variable exhibits strong spatial autocorrelation while the error term does not, the regression coefficients and significance levels remain valid. Additionally, if both the dependent and independent variables share an identical spatial pattern, and the spatial dependencies in the dependent variable are fully explained by those in the independent variable, the residuals may be spatially independent. However, this is not always the case, and it is essential to test for spatial autocorrelation in residuals to ensure the validity of the model.
To test this assumption, spatial autocorrelation of the residuals can be examined using Moran’s I, which measures whether residuals are clustered, dispersed, or randomly distributed in space. As mentioned before, it is first extract the residuals and define a spatial weights matrix (e.g., Queen or Rook contiguity). Then, Moran’s I is computed to measure the degree of clustering in residuals, with values close to +1 indicating positive spatial autocorrelation, -1 indicating negative autocorrelation, and 0 suggesting randomness.
Another method to test for spatial autocorrelation in OLS residuals
is to regress them on the residuals from nearby
observations. In this report, nearby residuals refer to
residuals from neighboring block groups, as defined by the Queen matrix.
The regression line between the residuals, OLS_RESIDU
and
WT_RESIDU
(weighted residuals from neighboring groups),
help identify any spatial autocorrelation. The slope
(b) of this regression represents the strength of spatial
dependence. It is calculated by estimating the relationship between the
residuals of one observation and those of its neighbors.
In R, there are methods to test other key assumption as well. We will continue using R for the analysis.
Another key assumption is Homoscedasticity, which aassume that the variance of the errors (residuals) remains constant across all levels of the independent variables. In R, we used Breusch-Pagan Test, Koenker-Bassett Test(also known as the Studentized Breusch-Pagan Test). and White Test to detect heteroscedasticity.
If the p-value is less than 0.05, then we can reject the null hypothesis for the alternate hypothesis of heteroscedasticity.
Another assumption is Normality of Errors, which assumes that residuals follow a normal distribution—a crucial requirement for valid hypothesis testing and confidence intervals. In R, we used Jarque-Bera Test .
The p-value determines whether the residuals follow a normal distribution. If the p-value is less than 0.05, then we can reject the Null Hypothesis(H₀) of normality for the alternative hypothesis of non-normality.
In this report, we also use R to run spatial lag and spatial error regressions. Spatial lag regression assumes the value of the dependent variable at one location is associated with the values of that variable in nearby locations, defined by weights matrix \(W\), whether rook, queen neighbors, or within certain distance of one another. In our context, the spatial lag model is defined as follows:
\[ \text{LNMEDHVAL} = \rho W \times \text{LNMEDHVAL} + \beta_0 + \beta_1 \times \text{PCTVACANT} + \beta_2 \times \text{PCTSINGLES} + \beta_3 \times \text{PCTBACHMOR} + \beta_4 \times \text{LNNBELPOV100} + \epsilon_i \] where:
The other term are same as in the OLS regression model, where:
The spatial error model, on the other hand, assumes that the residuals of the model are spatially autocorrelated. It assumes that the residual in one location is associated with residuals at nearby locations defined by the spatial weights matrix \(W\), in this case the queen spatial matrix. The spatial error model is defined as follows:
\[ \text{LNMEDHVAL} = \beta_0 + \beta_1 \times \text{PCTVACANT} + \beta_2 \times \text{PCTSINGLES} + \beta_3 \times \text{PCTBACHMOR} + \beta_4 \times \text{LNNBELPOV100} + \lambda W \times \epsilon + u \]
where:
The other term is the same as in the OLS regression model, where:
Both spatial error regression and spatial lag regression require standard assumptions of OLS regression, including linerarity, homoscedasticity, and normality of residuals, excepty for the assumptions of spatial independence among observations. This adjustment allows the model to account for spatial autocorrelation and spatial heterogeneity in the data through either the dependent variable (spatial lag model) or the error term (spatial error model). These two models minimize spatial patterns in residuals that could lead to biased and inefficient estimates.
We compare the results of spatial lag and spatial error regression with the OLS regression to decide whether the two spatial models perform better than OLS regression based on several criteria: Akaike Information Criterion (AIC), Schwarz Criterion (SC, also known as Bayesian Information Criterion, BIC), Log likelihood, and likelihood ratio test.
The Akaike Information Criterion (AIC) and Schward Criterion (SC or BIC) are used to compared the model’s goodness of fit. They work by estimating how much information is lost when a model is used to represent reality. Essentially, they balance how accurate the model is against how complicated it is. A lower AIC or SC score means the model does a better job at this balance.
The Log likelihood is a measure used in the maximum likelihood for fitting a statistical model to the data and estimating model parameters. Maximum likelihood picks the values of the parameters that make the observed data as likely as possible. The higher the log likelihood, the better the model explains the data.
The Likelihood Ratio Test is used to test whether adding a spatial dependence to a model (spatial lag or spatial error model) significantly improves the model’s fit compared to the OLS model. For this test:
To reject the null hypothesis for the alternative hypothesis that the spatial model provides a significantly better fit than OLS, the Likelihood Ratio Test should have a p-value is less than significant level, typically 0.05. Then, we can draw the conclusion whether the spatial model is better than OLS model. If not, the OLS model is adequate.
Note: the likelihood ratio test is not used to compare the spatial lag and spatial error model, but to compare the spatial model with the OLS model. The Likelihood Ratio test only work if compared between nested models, meaning that one model is simplified version of other – complicated model contains all the same parts as the simpler model, plus extra pieces. The spatial lag model and spatial error model is not in that case.
Alternatively, we can also compare the spatial models to OLS using the Moran’s I statistic,which measures the spatial autocorrelation of the residuals. Moran’s I ranges from -1 to 1, where -1 indicates perfect dispersion, 0 indicates no spatial autocorrelation, and 1 indicates perfect correlation. Our goals of using spatial model is to minimize the spatial autocorrelation of the residuals. If the Moran’s I of the residuals of the spatial model is closer to 0 than the Moran’s I of the residuals of the OLS model, then the spatial model is better at minimizing spatial autocorrelation. We can conclude that the spatial model is better captures the spatial dependencies in the data than the OLS model.
We also conduct Geographically Weighted Regression (GWR) analysis in R. Geographically Weighted Regression is a form of local regression that helps address spatial heterogeneity in data, which is essential when analyzing spatial data prone to Simpson’s Paradox – a phenomenon where trends identified in aggregated data may differ from those found within smaller subsets of the data. GWR allows us to examine the relationships at a local level rather than assuming they are uniform across the study area. The general GWR model is defined as follows:
\[ y_i = \beta_{i0} + \sum_{k=1}^{m} \beta_{ik}x_{ik} + \epsilon_i \] where:
In Geographicaly Weighted Regression (GWR), local regression is performed by fitting regression model at each observing point, using a subset of neighboring points. These neighbors are weighted according to their distance from the focal point. The bandwidth controls the number of neighbors used in the regression, which influence the degree of locality in the model. A smaller bandwidth results in a more localized model, while a larger bandwidth results in a more global model.
There are two types of bandwidths: adaptive and fixed. Fixed bandwidth use a constant distance for all points, while an adaptive bandwidth adjusts dynamically, ensuring a consistent number of neighbors for each regression point, regardless of variations in data density. In this case, we use adaptive bandwidth, which is more appropriate as it accounts for varying spatial densities in the data. This adaptive method offers greater flexibility, allowing the model to better capture local relationships in areas with differing population distributions.
Although the GWR model allows for spatial variation in relationships, the standard OLS assumptions including linerity, independence of observation, homoscedasticity, and normality of residuals still apply. Multicollinearity is accessed using the condition number. A high multicollinearity can lead to unstable estimates and clustering in parameter estimates. It is also important to note that GWR does not provide p-value for coefficient, as the model focuses on exploring spatial patterns rather than testing global hypothesis.
The Global Moran’s I analysis for our dependent variable,
LNMEDHVAL
(the natural log of median house value) reveals a
significant level of spatial autocorrelation. With a Moran’s I value of
0.794, the data exhibits strong positive spatial
autocorrelation, indicating that areas with similar median house values
tend to cluster together. This suggests that high-value areas are
surrounded by other high-value areas, while low-value areas are
surrounded by other low-value areas.
queen<-poly2nb(data, row.names=data$POLY_ID)
queenlist<-nb2listw(queen, style = 'W')
moran(data$LNMEDHVAL, queenlist, n=length(queenlist$neighbours), S0=Szero(queenlist))$`I`
## [1] 0.794
To validate the significance of this observed spatial
autocorrelation, a Monte Carlo permutation test was condicted using 999
simulations. This approach involved randomly permuting the values of
LNMEDHVAL
across spatial locations to generate a
distribution of the Moran’s I values under the null hypothesis of no
spatial autocorrelation. The results, showing in the histogram below,
indicate the observed Moran’s I value of 0.794 is significantly higher
than the values generated from random permutations, emphasizes the
presence of spatial autocorrelation in the data.
The results of the test has a p-value less than
0.05. This exceptionally low p-value rejects the null
hypothesis of no spatial autocorrelation, confirming that the spatial
autocorrelation in LNMEDHVAL
is statistically
significant.
##
## Monte-Carlo simulation of Moran I
##
## data: data$LNMEDHVAL
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = 0.8, observed rank = 1000, p-value <0.0000000000000002
## alternative hypothesis: two.sided
ggplot(data.frame(res = globalmoranMC$res), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept = globalmoranMC$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Global Moran's I",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
We also create a scatterplot to visualize the relationship between
the dependent variable (log transform median house value) and its
spatial lag. . The x-axis represents the logged median house values
LNMEDHVAL
, while the y-axis displays the spatially
lagged values of LNMEDHVAL
, calculated based on a queen
contiguity spatial weights matrix that considers neighboring spatial
units. If no spatial autocorrelation were present, the plot would
display no clear pattern. However, the plot shows a positive
linear relationship between the logged median house values and
their spatial lags, indicating the presence of positive spatial
autocorrelation, where areas with higher median house values are
surrounded by other high housing values areas. These results also
support the findings of the Global Moran’s I analysis, which identified
strong positive spatial autocorrelation in the dependent variable.
ggplot(data = data.frame(
LNMEDHVAL = data$LNMEDHVAL,
spatial_lag = lag.listw(queenlist, data$LNMEDHVAL)
), aes(x = LNMEDHVAL, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.7, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Global Moran's I Scatter Plot",
x = "Logged Median House Value",
y = "Spatial Lag of LNMEDHVAL") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
Then, we analyze the Local Moran’s I values for the
dependent variable LNMEDHVAL
. Unlike Global Moran’s I,
which looks at the overall pattern, Local Moran’s I helps us find
specific areas where values are similar to or different from nearby
areas. This allows us to identify clusters of high or low values and
unusual areas that stand out. We calculate Local Moran’s I for each
census block group to see if there are significant patterns in
transformed median housing values.
# d. Local Moran's I (LISA analysis) for LNMEHVAL
lmoran<-localmoran(data$LNMEDHVAL, queenlist)
localmoran <-cbind(data, as.data.frame(lmoran))
To visualize the results, we created a significance map to show the P-value of Local Moran’s I. The p-value tells us whether the spatial autocorrelation is statistically significant (real) or not. Lower p-values refers to areas where the spatial autocorrelation is statistically significant, either hotspots where high values concentrate together or coldspots where lower values concentrated together. In the map, dark red (p ≤ 0.001), red (p ≤ 0.01), and light pink (p ≤ 0.05) indicate areas where similar median housing values are grouped together, showing a significant spatial pattern. In contrast, grey (p > 0.05) represents areas without a strong pattern, where median housing values are more randomly distributed. In Philadelphia, the map reveals notable clustering in areas including Center City, City Line Avenue,Roxborough and Germantown, Far Northeast Philadelphia, indicating statistically significant spatial correlations in these regions.
moranSig.plot <- function(df, listw, title) {
local <- localmoran(x = df$LNMEDHVAL, listw = listw, zero.policy = FALSE)
df$Pr.z <- local[, "Pr(z != E(Ii))"]
df$pval_category <- cut(df$Pr.z,
breaks = c(0, 0.001, 0.01, 0.05, 1),
labels = c("0.000 - 0.001", "0.001 - 0.010", "0.010 - 0.050", "0.050 - 1.000"),
include.lowest = TRUE)
if (!inherits(df, "sf")) {
df <- st_as_sf(df)
}
ggplot(data = df) +
geom_sf(aes(fill = pval_category), color = NA, alpha = 0.9) +
scale_fill_manual(values = c("0.000 - 0.001" = "#800f2f",
"0.001 - 0.010" = "#ff4d6d",
"0.010 - 0.050" = "#ffccd5",
"0.050 - 1.000" = "grey"),
name = "P-Value") +
labs(title = title) +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 20, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
}
moranSig.plot(localmoran, queenlist, 'Significance Map of Local Moran I')
Philadelphia County Map
Source: United Inspection Agency
To visualize it more clearly, we generated a cluster map to identify different types of clustering patterns. The cluster Map derived from the Local Moran’s I analysis classified census tracts in Philadelphia into high-high, high-low, low-high, low-low, or not significant. This map helps us analyze how housing values are distributed and whether they form significant spatial clusters.
High-High (HH) Clusters: Center City, Roxborough, Germantown, and Far Northeast Philadelphia showed High-High (HH) clusters, meaning areas with high median housing values are surrounded by other high-value areas. These regions typically indicate concentrated economic activity and rich neighborhoods. These areas remain high house values due to high demand and possibly presence of amenities and other attractive urban features.
Low-Low (LL) Clusters: City Line Avenue, the northern part of North Philadelphia, the northeastern part of University City, and the northwestern part of South Philadelphia exhibited Low-Low (LL) clusters, where areas with low median housing values are surrounded by other low-value areas. These clusters may indicate economically underdeveloped or lower-income neighborhoods. These regions may require targeted policy interventions to address underlying issues that contribute to lower housing values.
High-Low (HL) Clusters: Three census blocks in South Philadelphia showed High-Low (HL) clusters, where high-value areas are surrounded by low-value areas. These locations represent spatial outliers, potentially indicating newly developed commercial or residential projects in historically low-income neighborhoods or gentrification neighborhood.
Low-High (LH) Clusters: Three scattered census blocks in Philadelphia exhibited Low-High (LH) clusters, meaning areas with low median housing values are surrounded by high-value areas. These clusters may correspond to historically preserved districts with strict development regulations or industrial or undeveloped land within high-value urban centers.
Non-Significant Clusters: The remaining areas in Philadelphia were classified as Non-Significant in white on the map, where local Moran’s O Statistic was not significant. These regions are scattered throughout the city where house values do not exhibit significant spatial clustering patterns. These areas may have more diverse housing markets or vibrant economic characteristics compared to the clustered regions.
hl.plot <- function(df, listw) {
local <- localmoran(x = df$LNMEDHVAL, listw = listw, zero.policy = FALSE)
quadrant <- vector(mode = 'numeric', length = nrow(df))
m.prop <- df$LNMEDHVAL - mean(df$LNMEDHVAL)
m.local <- local[, 1] - mean(local[, 1])
signif <- 0.05
quadrant[m.prop > 0 & m.local > 0] <- 1 # high-high
quadrant[m.prop < 0 & m.local < 0] <- 2 # low-low
quadrant[m.prop < 0 & m.local > 0] <- 4 # low-high
quadrant[m.prop > 0 & m.local < 0] <- 3 # high-low
quadrant[local[, 5] > signif] <- 5 # insignificant
df$quadrant <- factor(quadrant, levels = c(1, 3, 5, 2, 4),
labels = c("High-High", "High-Low", "Non-Significant", "Low-Low", "Low-High"))
if (!inherits(df, "sf")) {
df <- st_as_sf(df)
}
ggplot(data = df) +
geom_sf(aes(fill = quadrant), color = "#848884", lwd = 0.07) +
scale_fill_manual(values = c("High-High" = "#800f2f",
"High-Low" = "#ff4d6d",
"Non-Significant" = "white",
"Low-Low" = "#8d99ae",
"Low-High" = "#2b2d42"),
name = "Cluster Type") +
labs(title = "Local Moran's I Cluster Map") +
theme(legend.position="right",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 20, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
}
hl.plot(data, queenlist)
Our previous OLS regression analysis in the assignment one found that
all predictors in our OLS regression, which include
PCTSINGLES
, PCTVACANT
, LNNBELPOV
,
and PCTBACHMOR
, are statistically significant in explaining
LNMEDHVAL. The model accounts for approximately 66.2% of the variance,
as indicated by the R-squared value of 0.662. (Detailed interpretation
can be found in our previous report.)
##
## Call:
## lm(formula = LNMEDHVAL ~ LNNBELPOV + PCTBACHMOR + PCTSINGLES +
## PCTVACANT, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2582 -0.2039 0.0382 0.2174 2.2434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.113778 0.046532 238.84 < 0.0000000000000002 ***
## LNNBELPOV -0.078903 0.008457 -9.33 < 0.0000000000000002 ***
## PCTBACHMOR 0.020910 0.000543 38.49 < 0.0000000000000002 ***
## PCTSINGLES 0.002977 0.000703 4.23 0.000024 ***
## PCTVACANT -0.019156 0.000978 -19.59 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.366 on 1715 degrees of freedom
## Multiple R-squared: 0.662, Adjusted R-squared: 0.662
## F-statistic: 841 on 4 and 1715 DF, p-value: <0.0000000000000002
Despite the assumptions we tested in OLS regression, we detect other two assumptions further by statistics test, Homoscedasticity and Normality of Errors.
To detect the Homoscedasticity, which assumes that the variance of the errors (residuals) remains constant across all levels of the independent variables, we conducted the Breusch-Pagan Test, Koenker-Bassett Test and White Test. If the p-values from these tests are below the significance level (0.05), it suggests the presence of heteroscedasticity, meaning the variance of errors is not constant. If the p-values are high, we fail to reject the null hypothesis of homoscedasticity.
The p-value of the Breusch-Pagan test is less than 0.05, indicating strong evidence of heteroscedasticity.
# Breusch-Pagan test
reg<-lm(LNMEDHVAL ~ LNNBELPOV+PCTBACHMOR+PCTSINGLES+PCTVACANT, data=data)
bptest(reg, studentize=FALSE)
##
## Breusch-Pagan test
##
## data: reg
## BP = 113, df = 4, p-value <0.0000000000000002
The p-value of the Koenker-Bassett test is also less than 0.05, further supporting the presence of heteroscedasticity.
##
## studentized Breusch-Pagan test
##
## data: reg
## BP = 43, df = 4, p-value = 0.00000001
The p-value of the White test is 0, much less than 0.05, again further supporting the presence of heteroscedasticity.
## White's test results
##
## Null hypothesis: Homoskedasticity of the residuals
## Alternative hypothesis: Heteroskedasticity of the residuals
## Test Statistic: 43.94
## P-value: 0
Since all three tests indicate statistically significant results, we conclude that heteroscedasticity is present in the OLS Regression. This suggests that the variance of residuals is not constant, which may impact the efficiency of OLS estimates. To address this issue, spatial analysis techniques such as spatial lag models may be necessary to account for spatial dependence and improve model robustness.
However, compared to the scatter plot we made in previous assignment to identify homoscedasticity, the scatter plot of standardized residuals against fitted values shows a relatively consistent spread of residuals across different fitted values. This suggests that the residuals’ variance is relatively constant across the range of fitted values, indicating homoscedasticity. Regardless, the Breusch-Pagan, Koenker-Bassett, and White tests is often more sensitive and conservative than a residual plot. Even small deviations from homoscedasticity can be detected by these tests.
fitted_values <- fitted(reg)
residuals_values <- residuals(reg)
standardized_residuals <- rstandard(reg)
resnb<-sapply(queen, function(x) mean(standardized_residuals[x]))
data <- data %>%
mutate(
Fitted = fitted_values,
Residuals = residuals_values,
Standardized_Residuals = standardized_residuals,
Residuals_NB = resnb)
ggplot(data, aes(x = Fitted, y = Standardized_Residuals)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "#c44536", size = 1) + #
labs(
title = "Scatter Plot of Standardized Residuals vs Fitted Values",
x = "Fitted Values",
y = "Standardized Residuals"
) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
Normality of Errors assumes that residuals follow a normal distribution—a crucial requirement for valid hypothesis testing and confidence intervals. We use Jarque-Bera Test to conduct. If the p-value from this test is below the significance level (0.05), it suggests that the residuals deviate significantly from normality, indicating a violation of the normality assumption. If the p-value is high, we fail to reject the null hypothesis, suggesting that the residuals are normally distributed.
In this case, the p-value is less than 0.05, providing strong evidence against normality. This suggests that the residuals do not follow a normal distribution, which may impact the validity of standard inferential procedures.
##
## Jarque Bera Test
##
## data: reg$residuals
## X-squared = 779, df = 2, p-value <0.0000000000000002
However, the histogram of residuals from the previous OLS regression suggests a generally normal distribution with a bell-shaped curve centered around zero, as suggested in our assignment 1. This visual check indicates that the residuals are approximately normally distributed in terms of their values, but they may not be perfectly normally distributed in a spatial context. In addition, Jarque Bera Test very sensitive and conservative to minor deviations from the normal distribution. To address this potential issue, alternative methods, such as incorporating spatial weights, may be necessary to account for spatial dependencies and improve the model’s accuracy.
#histogram of residuals
residual_data <- data.frame(residuals = residuals(reg))
ggplot(residual_data, aes(x = residuals)) +
geom_histogram(bins = 100, fill = "black") +
labs(title = "Histogram of Standardized Residuals",
x = "Standardized Residuals",
y = "Frequency") +
theme_minimal() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
To further examine the spatial distribution of residuals, we first generate standardized residuals, which are OLS Model residuals divided by an estimate of their standard deviation. We map these residuals to visually assess whether there are any spatial patterns.
The map reveals some clusters of residual values, such as high-value cluster in the Far Northeast Philadelphia, suggesting a potential spatial concentration rather than random distribution. This observation calls for testing the presence of spatial autocorrelation in the residuals.
# g. OLS residuals plotted
data$OLS_RESIDU<-rstandard(reg)
data$WT_RESIDU<-sapply(queen, function(x) mean(data$OLS_RESIDU[x]))
quant_breaks <- classIntervals(data$OLS_RESIDU, n = 5, style = "quantile")$brks
custom_labels <- c()
for(i in seq_len(length(quant_breaks) - 1)) {
custom_labels[i] <- paste0(
round(quant_breaks[i], 2),
" to ",
round(quant_breaks[i + 1], 2)
)
}
data$OLS_RESIDU_quant <- cut(
data$OLS_RESIDU,
breaks = quant_breaks,
include.lowest = TRUE,
labels = custom_labels
)
ggplot(data) +
geom_sf(aes(fill = OLS_RESIDU_quant), color=NA) +
scale_fill_brewer(palette = "Reds", name = "Standardized OLS Residuals") +
labs(title = "Standardised OLS Residuals") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 20, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
We further test the presence of spatial autocorrelation in two ways: 1) by regressing residuals on their queen neighbors, and 2) by looking at the Moran’s I of the residuals.
Regressing Residuals on Spatial Lag
To assess spatial autocorrelation, we use Queen Contiguity
Matrix as spatial weight matrix to define spatial relationships
between residuals. We begin by creating a scatter plot comparing OLS
residuals OLS_RESIDU
to their spatial lag
WT_RESIDU
.
The scatter plot shows a linear relationship between the residuals and their spatial lag, suggesting that there is spatial dependence in the residuals. The Slope (b) of the regression line indicated the beta coefficient of the lagged residuals (which is found to be 0.732), which measures the strength of this spatial relationship rather than significant.
# scatterplot of OLS_RESIDU by WT_RESIDU
ggplot(data, aes(x = WT_RESIDU, y = OLS_RESIDU)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "#c44536", size = 1) +
labs(title = "Residuals vs. Nearest Neighbor Residuals",
x = "Nearest Neighbor Residuals",
y = "Standardized Residuals") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
Moran’s I
We also calculate Global Moran’s I of OLS residuals and perform 999 permutations to test its statistical significance. The Global Morans’ I is 0.312, which is significantly higher than the values generated from random permutations in the histogram, indicating the presence of spatial autocorrelation in the residuals. The results of the test has a p-value less than 0.05, which rejects the null hypothesis of no spatial autocorrelation, confirming that the spatial autocorrelation in residuals is statistically significant.
# h. Moran’s I of the OLS regression residuals
#Regressing residuals on their nearest neighbors.
moran(data$OLS_RESIDU, queenlist, n=length(queenlist$neighbours), S0=Szero(queenlist))$`I`
## [1] 0.312
moranMC<-moran.mc(data$OLS_RESIDU, queenlist, nsim=999, alternative="two.sided") #We use 999 permutations
moranMC
##
## Monte-Carlo simulation of Moran I
##
## data: data$OLS_RESIDU
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = 0.3, observed rank = 1000, p-value <0.0000000000000002
## alternative hypothesis: two.sided
ggplot(data.frame(res = moranMC$res), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept = moranMC$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Global Moran's I",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
The scatter plot also shows a positive linear relationship between
the residual OLS_RESIDU
and the spatial lag of residual. It
indicates the presence of positive spatial autocorrelation, meaning
areas with high residuals are surrounded by other high residuals. These
results also support the findings of the Global Moran’s I analysis,
which identified strong positive spatial autocorrelation in the OLS
residuals.
ggplot(data = data.frame(
OLS_RESIDU = data$OLS_RESIDU,
spatial_lag = lag.listw(queenlist, data$OLS_RESIDU)
), aes(x = OLS_RESIDU, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.7, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Global Moran's I Scatter Plot",
x = "OLS_RESIDU",
y = "Spatial Lag of OLS_RESIDU") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
Both the Moran’s I statistic and the positive beta coefficient from the regression of standardized residuals on their nearest neighbors tell a similar story. They both reveal the significant spatial autocorrelation in the residuals, indicating that errors are not randomly distributed across space. This suggests that the OLS model violates the assumption of Normality of Residuals, which can lead to biased standard errors, inefficient estimates, and misleading statistical inferences. A spatial model like the Geographically Weighted Regression (GWR) would be more appropriate to account for spatial autocorrelation and spatial heterogeneity in the data.
The spatial lag model includes the spatial lag of the dependent
variable (log-transformed median house value) as a addition predictors
to account for spatial autocorrelation, (\(\rho W \times \text{LNMEDHVAL}\)). \(\rho\) is the coefficient of the spatial
lag of the dependent variable (LNMEDHVAL
) and \(W\) is the spatial weights matrix.
As shown in the spatial lag regression model results below, the
p-value of the spatial lag coefficient is less than
0.001, indicating that the spatial lag of the dependent
variable is statistically significant. The coefficient of the spatial
lag of LNMEDHVAL
is 0.651, indicating that
a 1% increase in the median home value of neighboring areas (as defined
by Queen contiguity) is associated with an approximate 0.65% increase in
the focal area’s median home value, holding other predictors constant.
This also suggests the presence of the positive spatial autocorrelation
in the dependent variable, as \(\rho
>0\), where high-value areas are surrounded by other
high-value areas.
All other predictors in the spatial lag model, including
PCTVACANT
, PCTSINGLES
,
PCTBACHMOR
, and LNNBELPOV
, are also
statistically significant.
PCTVACANT
has a coefficient of -0.0085 and is highly
significant since the p-value is less than 0.001. This
indicates that a higher vacancy rate is associated with lower median
house values.PCTSINGLES
has a coefficient of 0.00203, which is also
statistically significant with a p-value of less than
0.001. This suggests that a higher percentage of single-person
households is associated with higher median house values.PCTBACHMOR
has a coefficient of 0.00851 and is
statistically significant with a p-value of less than
0.001, indicating that a higher percentage of residents with a
bachelor’s degree or higher is associated with higher median house
values.LNNBELPOV
has a coefficient of -0.0341 and is
statistically significant with a p-value of less than
0.001, suggesting that a higher percentage of residents below
the poverty line is associated with lower median house values.The OLS model also shows significant for all predictors, but with
larger coefficient for all predictors, which suggests that, in the OLS
Model, all those predictors has larger effects or larger magnitude of
influence on the dependent variable (LNMEDHVAL
). For
example, the coefficient of PCTVACANT
is -0.0192 in the OLS
model, while is -0.0085 in the spatial lag model. This suggests that the
OLS model overestimates the influence of these predictors because of
omitted spatial autocorrelation.
sl <- lagsarlm(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV, data=data, queenlist)
summary(sl)
##
## Call:lagsarlm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV, data = data, listw = queenlist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.655421 -0.117248 0.018654 0.133126 1.726436
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.89845489 0.20111357 19.3843 < 0.00000000000000022
## PCTVACANT -0.00852940 0.00074367 -11.4694 < 0.00000000000000022
## PCTSINGLES 0.00203342 0.00051577 3.9425 0.00008063503
## PCTBACHMOR 0.00851381 0.00052193 16.3120 < 0.00000000000000022
## LNNBELPOV -0.03405466 0.00629287 -5.4116 0.00000006246
##
## Rho: 0.651, LR test value: 912, p-value: < 0.000000000000000222
## Asymptotic standard error: 0.018
## z-value: 36.1, p-value: < 0.000000000000000222
## Wald statistic: 1301, p-value: < 0.000000000000000222
##
## Log likelihood: -256 for lag model
## ML residual variance (sigma squared): 0.0719, (sigma: 0.268)
## Number of observations: 1720
## Number of parameters estimated: 7
## AIC: 525, (AIC for lm: 1435)
## LM test for residual autocorrelation
## test value: 67.7, p-value: 0.00000000000000022204
We ran the Breusch-Pegan Test to assess whether the residuals of the spatial lag regression model exhibit heteroscedasticity. The test statistic for the spatial lag regression is BP = 211 with degree of freedom = 4 and p-value < 0.0001. With the extremely low p-value, we reject the null hypothesis of homosceaasticity, indicating that the residuals of the spatial lag model are still heteroscedastic.
##
## Breusch-Pagan test
##
## data:
## BP = 211, df = 4, p-value <0.0000000000000002
We also compared the AIC, Schwarz Criterion, and Log Likelihood of the OLS regression and the spatial lag regression models. The results show that the spatial lag model provides a better fit compared to the OLS model.
aic_ols <- AIC(reg)
aic_sl <- AIC(sl)
# Schwarz Criterion
bic_ols <- BIC(reg)
bic_sl <- BIC(sl)
# The Log Likelihood
loglik_ols <- logLik(reg)
loglik_sl <- logLik(sl)
results <- data.frame(
Model = c("OLS Regression", "Spatial Lag Regression"),
AIC = c(aic_ols, aic_sl),
Schwarz_Criterion = c(bic_ols, bic_sl),
Log_Likelihood = c(loglik_ols, loglik_sl)
)
results %>%
kable(row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Model | AIC | Schwarz_Criterion | Log_Likelihood |
---|---|---|---|
OLS Regression | 1435 | 1468 | -711 |
Spatial Lag Regression | 525 | 564 | -256 |
We also conducted a Likelihood Ratio Test to access the improvement in model fit between the OLS and spatial lag models. The test statistic results shows LR = 912 with a p-value of less than 0.0001, indicating that the spatial lag model is a significantly better fit than the OLS model.
##
## Likelihood ratio for spatial linear models
##
## data:
## Likelihood ratio = 912, df = 1, p-value <0.0000000000000002
## sample estimates:
## Log likelihood of sl Log likelihood of reg
## -256 -711
Finally, we examined Moran’s I for the spatial lag regression model’s residuals. The obserbed Moran’s I of -0.08 suggests a small negative spatial autocorrelation in the residuals of the spatial lag model. The p-value of the Moran’s I test is 0.002, indicating that the spatial autocorrelation in the residuals is statistically significant. We reject the null hypothesis of no spatial autocorrelation. However, the Moran’s I value is close enough to zero, especially compared to OLS model, which suggests that the spatial lag model has effectively addressed the most of the spatial dependencies in the data.
##
## Monte-Carlo simulation of Moran I
##
## data: sl$residuals
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = -0.08, observed rank = 1, p-value = 0.002
## alternative hypothesis: two.sided
The residual histogram and the scatter plot of the residuals also support that the spatial lag model has effectively addressed the spatial autocorrelation in the residuals compared to OLS. The histogram shows a more normal distribution of residuals. The scatter plot shows a weaker relationship between residuals and their spatial lag, indicating that the spatial lag model has reduced the spatial autocorrelation in the residuals. All of these results indicate that the spatial lag model is a better fit for the data than the OLS model.
ggplot(data.frame(res = sl$residuals), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept =sl_moranMc$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Moran's I of Spatial Lag Residuals",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
ggplot(data = data.frame(
residuals =sl$residuals,
spatial_lag = lag.listw(queenlist, sl$residuals)
), aes(x = residuals, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Moran's I Scatter Plot for Spatial Lag Residuals",
x = "Logged Median House Value",
y = "Spatial Lag of LNMEDHVAL") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
The spatial error model includes the spatial error term (\(\lambda W \times \epsilon + u\)) to account for spatial autocorrelation in the residuals. \(\lambda\) is the coefficient of the spatial error term, and \(W\) is the spatial weights matrix. \(u\) is the random noise.
In this case, \(\lambda\) is estimated at 0.815 and is statistically significant with a p-value of less than 0.001. This indicates that the spatial error term is a significant predictor of the dependent variable. It also indicate that the strong spatial autocorrelation in the error terms.
Besides, all other predictors in the spatial error model, including
PCTVACANT
, PCTSINGLES
,
PCTBACHMOR
, and LNNBELPOV
, are also
statistically significant.
PCTVACANT
has a coefficient of -0.0057 and is highly
significant since the p-value is less than 0.001. This
indicates that a higher vacancy rate is associated with lower median
house values.PCTSINGLES
has a coefficient of 0.0027, which is also
statistically significant with a p-value of less than
0.001. This suggests that a higher percentage of single-unit
households is positive associated with higher median house values.PCTBACHMOR
has a coefficient of 0.0098 and is
statistically significant with a p-value of less than
0.001, indicating that a higher percentage of residents with a
bachelor’s degree or higher is associated with higher median house
values.LNNBELPOV
has a coefficient of -0.0345 and is
statistically significant with a p-value of less than
0.001, suggesting that a higher percentage of residents below
the poverty line is negative associated with lower median house
values.Similar to spatial lag regression, when comparing these results with
the OLS regression model, the coefficients for our predictors are
smaller than in the OLS model. For example, the coefficient of
PCTVACANT
is -0.0192 in the OLS model, while it is -0.0057
in the spatial error model. This suggests that the OLS model
overestimates the influence of these predictors due to omitted spatial
autocorrelation.
se <- errorsarlm(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV, data=data, queenlist)
summary(se)
##
## Call:errorsarlm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV, data = data, listw = queenlist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.926477 -0.115408 0.014889 0.133852 1.948664
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.90643427 0.05346777 203.9815 < 0.00000000000000022
## PCTVACANT -0.00578308 0.00088670 -6.5220 0.00000000006937
## PCTSINGLES 0.00267792 0.00062083 4.3134 0.00001607389089
## PCTBACHMOR 0.00981293 0.00072896 13.4615 < 0.00000000000000022
## LNNBELPOV -0.03453409 0.00708933 -4.8713 0.00000110881162
##
## Lambda: 0.815, LR test value: 678, p-value: < 0.000000000000000222
## Asymptotic standard error: 0.0164
## z-value: 49.8, p-value: < 0.000000000000000222
## Wald statistic: 2477, p-value: < 0.000000000000000222
##
## Log likelihood: -373 for error model
## ML residual variance (sigma squared): 0.0766, (sigma: 0.277)
## Number of observations: 1720
## Number of parameters estimated: 7
## AIC: 759, (AIC for lm: 1435)
Based on the results of Breush-Pegan test, the spatial error residuals are also heteroscedastic. The test statistics is BP = 211 with degree of freedom = 4 and p-value < 0.0001. Since the p-value us less than 0.001, we reject the null hypothesis of homoscedasticity. This indicates that the residuals of the spatial error model are heteroscedastic. However, as mentioned above, the breusch-pagan test is very sensitive and conservative to minor deviations from homoscedasticity.
##
## Breusch-Pagan test
##
## data:
## BP = 23, df = 4, p-value = 0.0001
We also compared the AIC, Schwarz Criterion, and Log Likelihood of the OLS regression, spatial lag regression, and spatial error regression models. The results show that the spatial error model provides a better fit compared to the OLS model, but not as good as the spatial lag model, as indicated by the larger AIC and Schwarz Criterion values than the spatial lag model but smaller than the OLS model. The log likelihood of the spatial error model is also higher than the OLS model, suggesting that the spatial error model explains more variance in the dependent variable.
aic_se <- AIC(se)
# Schwarz Criterion
bic_se <- BIC(se)
# The Log Likelihood
loglik_se <- logLik(se)
results <- data.frame(
Model = c("OLS Regression", "Spatial Lag Regression", "Spatial Error Regression"),
AIC = c(aic_ols, aic_sl, aic_se),
Schwarz_Criterion = c(bic_ols, bic_sl, bic_se),
Log_Likelihood = c(loglik_ols, loglik_sl, loglik_se)
)
results %>%
kable(row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Model | AIC | Schwarz_Criterion | Log_Likelihood |
---|---|---|---|
OLS Regression | 1435 | 1468 | -711 |
Spatial Lag Regression | 525 | 564 | -256 |
Spatial Error Regression | 759 | 798 | -373 |
The Likehood Ratio Test also indicates that the spatial error model is a significantly better fit than the OLS model. The test statistic results shows LR = 678 with a p-value less than 0.0001. With a larger test statistic and low p-value, it also indicate that the spatial error model has a better fit than the OLS model.
##
## Likelihood ratio for spatial linear models
##
## data:
## Likelihood ratio = 678, df = 1, p-value <0.0000000000000002
## sample estimates:
## Log likelihood of se Log likelihood of reg
## -373 -711
Finally, we examined the Moran’s I for spatial error model’s residuals. The observed Moran’s I of -0.09 suggests a small negative spatial autocorrelation in the residuals of the spatial error model. The p-value of the Moran’s I test is 0.002, indicating that the spatial autocorrelation in the residuals is statistically significant. We reject the null hypothesis of no spatial autocorrelation. However, the Moran’s I value is close enough to zero, especially compared to OLS model, which suggests that the spatial error model has effectively addressed the most of the spatial dependencies in the data.
##
## Monte-Carlo simulation of Moran I
##
## data: residuals(se)
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = -0.09, observed rank = 1, p-value = 0.002
## alternative hypothesis: two.sided
The residual histogram and the scatter plot of the residuals also support that the spatial error model has effectively addressed the spatial autocorrelation in the residuals compared to OLS. The histogram shows a more normal distribution of residuals. The scatter plot shows a weaker relationship between residuals and their spatial lag, indicating that the spatial error model has reduced the spatial autocorrelation in the residuals. All of these results indicate that the spatial error model is a better fit for the data than the OLS model.
ggplot(data.frame(res = residuals(se)), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept =se_moranMc$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Moran's I of SE Residuals",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
ggplot(data = data.frame(
residuals = residuals(se),
spatial_lag = lag.listw(queenlist, residuals(se))
), aes(x = residuals, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Moran's I Scatter Plot for SE Residuals",
x = "Logged Median House Value",
y = "Spatial Lag of LNMEDHVAL") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
Geographically Weighted Regression (GWR) is a spatial statistical
method that allows regression coefficients to vary across space. This
flexibility enables the model to better account for spatial
heterogeneity, offering localized insights into relationships between
predictors and the dependent variable LNMEDHVAL
.
# Run adaptive bandwidth selection
bw_gwr <- gwr.sel(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV,
data = as(data, "Spatial"),
method = "aic",
adapt = TRUE)
#Run GWR
model_gwr <- gwr(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV,
data = as(data, "Spatial"),
gweight=gwr.Gauss,
adapt = bw_gwr,
hatmatrix = TRUE, se.fit = TRUE)
# Extract results
results_gwr <- model_gwr$SDF
## Call:
## gwr(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV, data = as(data, "Spatial"), gweight = gwr.Gauss,
## adapt = bw_gwr, hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.00813 (about 13 of 1720 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max. Global
## X.Intercept. 9.67276 10.71432 10.95424 11.17420 12.08314 11.11
## PCTVACANT -0.03174 -0.01424 -0.00896 -0.00358 0.01679 -0.02
## PCTSINGLES -0.02497 -0.00755 -0.00166 0.00423 0.01433 0.00
## PCTBACHMOR 0.00110 0.01014 0.01493 0.02022 0.03473 0.02
## LNNBELPOV -0.23652 -0.07336 -0.04012 -0.01267 0.09488 -0.08
## Number of data points: 1720
## Effective number of parameters (residual: 2traceS - traceS'S): 361
## Effective degrees of freedom (residual: 2traceS - traceS'S): 1359
## Sigma (residual: 2traceS - traceS'S): 0.276
## Effective number of parameters (model: traceS): 258
## Effective degrees of freedom (model: traceS): 1462
## Sigma (model: traceS): 0.266
## Sigma (ML): 0.246
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 661
## AIC (GWR p. 96, eq. 4.22): 309
## Residual sum of squares: 104
## Quasi-global R2: 0.848
We begin by comparing the R-squared of the GWR model
with that of the OLS model. The R-squared value for the OLS regression
model is 0.662, while the R-squared or Quasi-global R-squared value for
the GWR model is 0.848. The higher R-squared value of
the GWR model indicates that it explains more variance in the dependent
variable (LNMEDHVAL
), compared to the OLS model. This also
suggests that the spatial variability captured by the GWR model provides
a more accurate representation of the relationships between the
predictors and the dependent variable, as GWR account for spatial
heterogeneity.
# Compare R-squared
rss_gwr <- model_gwr$results$rss
mean_y <- mean(data$LNMEDHVAL)
tss <- sum((data$LNMEDHVAL - mean_y)^2)
r2_gwr <- 1 - (rss_gwr / tss)
r2_ols <- summary(reg)$r.squared
r2_compare <- data.frame(
Model = c("OLS", "GWR"),
R_squared = c(r2_ols, r2_gwr))
r2_compare %>%
kable(row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Model | R_squared |
---|---|
OLS | 0.662 |
GWR | 0.848 |
Next, we compare model fit using Akaike Information Criteria (AIC and AICc). AIC (Akaike Information Criterion) measures the relative quality of a statistical model, balancing goodness of fit and complexity.
Based on the Akaike Information Criteria (AIC) values, the GWR model has an AIC of 309, which is significantly lower than the AIC of the OLS model (1435), the Spatial Lag model (525), and the Spatial Error model (759). Since a lower AIC indicates a better fit, this indicates that the GWR model provides a better fit to the data compared to the other models. The lower AIC value of the GWR model suggests that it is the most appropriate model for this dataset, as it captures the spatial variability and heterogeneity in the data more effectively.
# Extract AIC
aic_gwr <- 309
model_names <- c("OLS", "Spatial Lag", "Spatial Error", "GWR")
aic_vals <- c(aic_ols, aic_sl, aic_se, aic_gwr)
aic_table <- data.frame(
Model = model_names,
AIC = c(
format(aic_ols, digits = 3),
format(aic_sl, digits = 3),
format(aic_se, digits = 3),
format(aic_gwr, digits = 3)
)
)
aic_table %>%
kable(row.names = FALSE, caption = "Comparison of AIC Values Across Models") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Model | AIC |
---|---|
OLS | 1435 |
Spatial Lag | 525 |
Spatial Error | 759 |
GWR | 309 |
Next, we assess the spatial autocorrelation of the GWR residuals. Specifically, we compare Moran’s I values across all models to determine whether GWR effectively reduces spatial autocorrelation.
The Moran’s I for GWR residuals is approximately 0.03 with a p-value less than 0.05, indicating a significant reduction in spatial autocorrelation compared to the OLS residuals (Moran’s I ≈ 0.31). This value is also lower than the Moran’s I values for the Spatial Lag and Spatial Error residuals, suggesting GWR is more effective at accounting for spatial dependence.
##
## Monte-Carlo simulation of Moran I
##
## data: results_gwr$gwr.e
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = 0.03, observed rank = 990, p-value = 0.02
## alternative hypothesis: two.sided
The histogram and scatterplot demonstrate that the GWR model significantly reduces spatial autocorrelation in the residuals compared to other models, as indicated by the neraly horizontal trend of the line. This trend is consistent with the histogram results, where the Moran’s I value is close to zero, indicating that the model effective captures local spatial variation in log-transformed median house values.
residuals_gwr <- results_gwr$gwr.e
ggplot(data.frame(res = residuals_gwr), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept = GWR_moranMc$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Moran's I of GWR Residuals",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
ggplot(data = data.frame(
residuals = residuals_gwr,
spatial_lag = lag.listw(queenlist, residuals_gwr)
), aes(x = residuals, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Moran's I Scatter Plot for GWR Residuals",
x = "GWR Residuals",
y = "Spatial Lag of Residuals") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
While the global GWR model provides an improved overall fit, it is equally important to assess how predictor effects and model performance vary across space. This section presents maps of t-values (coefficient divided by standard error) for each predictor and a choropleth of local R² values. These visualizations help identify where predictors are locally significant and how well the model fits in different neighborhoods.
We begin by calculating the t-values for each coefficient. These t-value maps reveal substantial geographic variation in predictor significance.
PCTVACANT
generally
has a negative impact on housing values, with significant negative
effects observed in Center City and parts of North and West
Philadelphia. However, there is a localize area in the southern Center
City where the vacancy rate has a positive impact on housing
values.PCTSINGLES
positively impacts the median housing values,
particularly in the northeast and northwest areas of Philadelphia.
However, in some parts of the city, like the eastern Philadelphia, parts
of the west and the south, the percentage of single-family housing has a
negative impact on housing values.PCTBACHMOR
has a positive impact on housing
values across the city, with no areas showing a negative impact.LNNBELPOV
generally
has a negative effect on median housing values. This effect is more
significant in southern, eastern, and northeastern parts of
Philadelphia.Overall, these spatial variations in predictor significance highlight the importance of considering local effects when modeling housing values in Philadelphia.
data$t_PCTVACANT <- results_gwr$PCTVACANT / results_gwr$PCTVACANT_se
data$t_PCTSINGLES <- results_gwr$PCTSINGLES / results_gwr$PCTSINGLES_se
data$t_PCTBACHMOR <- results_gwr$PCTBACHMOR / results_gwr$PCTBACHMOR_se
data$t_LNNBELPOV <- results_gwr$LNNBELPOV / results_gwr$LNNBELPOV_se
data$t_PCTVACANT_cat <- cut(data$t_PCTVACANT,
breaks = c(-Inf, -2, 0, 2, Inf),
labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))
data$t_PCTSINGLES_cat <- cut(data$t_PCTSINGLES,
breaks = c(-Inf, -2, 0, 2, Inf),
labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))
data$t_PCTBACHMOR_cat <- cut(data$t_PCTBACHMOR,
breaks = c(-Inf, -2, 0, 2, Inf),
labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))
data$t_LNNBELPOV_cat <- cut(data$t_LNNBELPOV,
breaks = c(-Inf, -2, 0, 2, Inf),
labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))
p1 <- ggplot(data) +
geom_sf(aes(fill = t_PCTVACANT_cat), color = NA) +
scale_fill_manual(
values = c("#2b2d42", "#8d99ae", "#ff4d6d", "#800f2f"),
name = "Ratio Category"
) +
labs(title = "PCTVACANT Coefficient to Standard Error",
x = "", y = "") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
p2 <- ggplot(data) +
geom_sf(aes(fill = t_PCTSINGLES_cat), color = NA) +
scale_fill_manual(
values = c("#2b2d42", "#8d99ae", "#ff4d6d", "#800f2f"),
name = "Ratio Category"
) +
labs(title = "PCTSINGLES Coefficient to Standard Error",
x = "", y = "") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
p3 <- ggplot(data) +
geom_sf(aes(fill = t_PCTBACHMOR_cat), color = NA) +
scale_fill_manual(
values = c("#ff4d6d", "#800f2f"),
name = "Ratio Category"
) +
labs(title = "PCTBACHMOR Coefficient to Standard Error",
x = "", y = "") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
p4 <- ggplot(data) +
geom_sf(aes(fill = t_LNNBELPOV_cat), color = NA) +
scale_fill_manual(
values = c("#2b2d42", "#8d99ae", "#ff4d6d"),
name = "Ratio Category"
) +
labs(title = "LNNBELPOV Coefficient to Standard Error",
x = "", y = "") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
p1 + p2 + p3 + p4 +
plot_layout(ncol = 2)
Next, we explore model fit at the local level by visualizing the local R² values from the GWR model. The choropleth map of local R² illustrates the spatial variation in how well the GWR model explains median house values across Philadelphia. Most neighborhoods fall within a moderate to high explanatory range, with local R² values commonly between 0.59 and 0.74. Particularly strong model performance is observed in South Philadelphia, Roxborough–Manayunk, and portions of the Far Northeast, where local R² values exceed 0.74. These areas exhibit stronger spatial relationships between the predictors and housing values. In contrast, several central and north-central tracts, especially around Center City and North Philadelphia, show lower local R² values (as low as 0.17), suggesting the model does not capture as much of the local variation in these areas. This could be due to the influence of unobserved neighborhood characteristics, development dynamics, or non-linear relationships not accounted for by the current predictors.
data$localR2 <- results_gwr$localR2
breaks <- classInt::classIntervals(data$localR2, n = 5, style = "quantile")$brks
labels <- sapply(1:(length(breaks) - 1), function(i) {
paste0(round(breaks[i], 2), " to ", round(breaks[i + 1], 2))
})
data$localR2_cat <- cut(data$localR2, breaks = breaks, include.lowest = TRUE, labels = labels)
ggplot(data) +
geom_sf(aes(fill = localR2_cat), color = NA) +
scale_fill_brewer(palette = "Reds", name = "Local R²") +
labs(title = "Local R² Values from GWR") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
Together, these spatial diagnostics confirm that the relationships between predictors and median house values are not constant across space, further supporting the case for using GWR over global regression models.
In this report, we conducted a comparative analysis of four regression model in predicting median house values across Philadelphia: Ordinary Least Squares (OLS), Spatial Lag, Spatial Error, and Geographically Weighted Regression (GWR). While OLS provided a useful baseline, we found significant spatial autocorrelation in its residuals (Moran’s I ≈ 0.31), indicating a violation of model assumptions. Both the Spatial Lag and Spatial Error models improved upon OLS by accounting for spatial dependencies, and the Likelihood Ratio tests confirmed that these models were statistically superior to OLS. However, GWR outperformed all other models, achieving the highest R² (≈ 0.848) and the lowest residual spatial autocorrelation (Moran’s I ≈ 0.03). Moreover, it provided localized insight into how each predictor’s influence varied across neighborhoods, revealing nuanced spatial patterns that global models could not detect. Therefore, GWR was the best model overall for explaining the spatial variation in the log transformed median housing values in Philadelphia.
While GWR offers several advantages, such as cpaturing spatial
heterogeneity and providing localized insight, it also has
limitations. Most notably, a lot of assumptions we have
in OLS regression still hold in GWR, such as lineraity,
homoscedasticity, and normality of residuals. For example, from our
previous assignment, we know that there’s multicollinarity between
predictors, like higher education attainment is associated with lower
poverty rate. Our current GWR model does not account for that. We also
know that some of our predictors are not normally distributed, like
PCTSINGLES
. Lastly, the residuals of the GWR are also not
perfectly normally distributed, as shown in the histogram. These
limitations suggest that while GWR is a powerful tool for spatial
analysis and explorative analysis, it should be used in conjunction with
other methods and diagnostics to ensure robust and reliable results.
It’s important to clarify a common point of confusion: weighted residuals or spatially lagged residuals and spatial lag model residuals are not the same. Weighted (or spatially lagged) residuals are created by averaging the residuals of a spatial unit’s neighbors, as defined by a spatial weights matrix. These are used in diagnostic plots (e.g., Moran’s I scatterplots) to detect spatial patterns in residuals. In contrast, spatial lag model residuals are the actual model errors (observed - predicted) from a regression that includes a spatially lagged dependent variable as a predictor. These residuals result after accounting for the endogenous influence of nearby outcome values. It is critical to use precise terminology: “spatially lagged residuals” should not be used to describe residuals from a spatial lag model; rather, it refers to the spatial average of residuals from another model (e.g., OLS or GWR).
Finally, while ArcGIS offers tools for running GWR, it presents several limitations. ArcGIS lacks flexibility in customizing bandwidth selection, kernel type, or weighting functions. More importantly, it does not provide full diagnostic outputs such as local standard errors, residuals, or detailed summaries of model performance. This restricts the user’s ability to evaluate model quality or detect multicollinearity and violates reproducibility standards common in research. In contrast, R’s spgwr package allows transparent control over model parameters and facilitates detailed post-model diagnostics, making it a more robust environment for conducting GWR analysis.