Keywords: Nearest Neighbor Index (NNI), Ripley’s K-Function, Ripley’s L-Function, Spatial Clustering
GitHub Repository: MUSA5000-Point-Pattern-Analysis | Website | Back to Main Page
Access to fresh and healthy food is a critical issues in many American cities. Lack of access to fresh food may lead to significant public health challenges and worsen health disparities. In Philadelphia, the Food Trust has established a network of farmers markets to improve access to fresh food. However, some neighborhoods may still face limited or no access to these markets, raising significant concerns about the potential food desert and access inequity. This study aims to analyze the spatial distribution of farmers markets in Philadelphia to determine whether they are randomly placed, clustered, or dispersed across the city. At the same time, the study also try to identify the areas lack of access to farmers markets and the potential food deserts.
To set up the hypothesis for our analysis, first, we need to understand the concept of completely spatially random (CSR)
A point pattern is considered to be Completely Spatially Random (CSR) if the points are distributed without any discernible pattern, which means their placement is entirely random. CSR serves as a baseline model in spatial clustering analysis, allowing researchers determine whether a distribution is random or exhibits clustering or dispersion. It often used as a null hypothesis in point pattern analysis to assess deviations in spatial distributions.
To establish that a point pattern is CSR, two critical conditions must be met:
Together, these two conditions, equal probability of placement and independence of points, ensure that the point pattern is completely random. If either condition is violated, the point pattern may exhibit clustering or dispersion, indicating that the points are not randomly distributed.
For our point pattern analysis, the null hypothesis \(H_0\) is shown as followed:
\[ H_0: \text{The point pattern of farmer's market in Philadelphia follows completely spatially random.} \] This implies that points are equally likely distributed anywhere within the study area, with no preference for specific locations. In addition, the placement of one point does not influence the placement of other points. CSR assumes no clustering or systematic spacing between points, which is the default or baseline spatial distribution.
On the other hand, the alternative hypothesis \(H_a\) is: \[ H_a: \text{The point pattern of farmer's market in Philadelphia exhibits clustering or dispersion.} \] Clusterings occurs when points are concentrated in specific areas, reflecting an underlying attraction between points, such as hotspots of activity. Dispersion, on the other hand, occurs when points are evenly spaced apart, indicating a repulsion effect or avoidance behavior. In this case, the point pattern does not follow CSR.
The Quadrant method is a typical spatial analysis technique used to study the distribution of points within a defined area. It involves dividing the study area into quadrats (small, equally sized square cells). The number of points in each quadrant is then counted and analyzed to determine whether the points are randomly distributed, clustered, or dispersed. The counts are used to compute statistical measures, such as the variance-to-mean ratio such that if VMR is close to 1, the points are randomly distributed. If VMR is significantly greater than 1, the points are clustered, and if VMR is significantly less than 1, the points are dispersed.
While this method is simple and widely used for basic spatial pattern analysis, it has several limitations that make it less practical for more detailed or complex analyses:
Combined with all these limitations, the quadrat method may not be suitable for more complex spatial patterns or when detailed spatial information is required. More advanced methods, such as Ripley’s K-function or nearest neighbor analysis, are often preferred for analyzing point patterns in spatial data.
The Nearest Neighbor Analysis (NNA) is another spatial statistical technique used to assess whether the distribution of point features in a given area exhibits clustering, randomness, or regular dispersion. It does so by comparing the observed average distance between each point and its nearest neighbor to the expected average distance under a completely spatially random (CSR) pattern.
To be more mathematical, the Observed Average Distance is \(\bar{D}_O\), representing the mean distance from each point to its closest neighboring point in the observed dataset and can be calculated as:
\[ \bar{D}_O = \frac{\sum_{i=1}^{n} D_i}{n} \]
Where:
The Expected Average Distance is \(\bar{D}_E\), representing the average nearest neighbor distance that would be expected if the points were randomly distributed across the area and can be calculated by:
\[ \bar{D}_E = \frac{0.5}{\sqrt{n / A}} \]
Where:
The Nearest Neighbor Index (NNI) is defined as the ratio of the observed average nearest neighbor distance to the expected average distance under a random distribution:
\[ NNI = \frac{\text{Observed Average Distance}}{\text{Expected Average Distance (when pattern is random)}} = \frac{\bar{D}_O}{\bar{D}_E} \]
Where:
When the Nearest Neighbor Index (NNI) is approximately equal to 1, it suggests that the distribution of points is random, indicating the observed average distance is similar to what would be expected under a completely spatially random pattern. If NNI < 1, the point pattern is suggested as clustered, indicating points are closer together than would be expected by chance.If NNI > 1, the distribution is suggested as dispersed, indicating points are more widely spaced than expected under randomness.
More specifically, when NNI ≈ 0, it indicates that all points are concentrated at the same location (i.e., the observed average distance is zero), reflecting a highly clustered spatial pattern. When NNI = 1, the observed and expected average distances are equal, indicating a random spatial pattern. When NNI = 2.149, the index reaches its theoretical maximum, indicating a perfectly uniform point distribution, such as one following a hexagonal lattice pattern.
In this report, the study area is defined as the administrative boundary of the city of Philadelphia, specifically using the 2013 ZIP code-level shapefile of Philadelphia. The spatial analysis of farmers market distribution is conducted within this defined geographic boundary. All distance and spatial statistics are calculated based on the locations of farmers markets within these Philadelphia ZIP code boundaries.
The significance of the Nearest Neighbor Index (NNI) is determined by a hypothesis test based on the standard normal (z) distribution. The hypotheses are as follows:
The z-score from the standard normal distribution is calculated as:
\[ z = \frac{𝑑̅ₒ - 𝑑̅ₑ}{SE} \]
Where:
This is a two-tailed test, since the alternative hypothesis considers both clustering and dispersion. A z-score of \(|1.96|\) corresponds to an α-value of 0.05. If \(z > 1.96\), we reject \(H₀\) for \(Hₐ\), indicating significant dispersion, which is points are more spread out than expected under randomness; If \(z < -1.96\), we also reject \(H₀\) for \(Hₐ\), but this time indicating significant clustering, which is points are more tightly packed than expected. If \(-1.96 \leq z \leq 1.96\), we fail to reject \(H₀\), indicating that the observed point pattern is not significantly different from a random distribution.
While Nearest Neighbor Analysis (NNA) is a useful method for detecting point patterns, it has several important limitations, especially in complex urban environments or irregularly shaped study areas.
NNA assumes a rectangular study area, regardless of the actual shape of the region. For example, in the case of hospital locations in Philadelphia, the hospitals are clustered in Center City. However, because the tool uses a smaller rectangular bounding box rather than the actual city outline, the calculated area was smaller than the actual distribution area, which is primarily concentrated in Center City. This underestimation of the study area decreased the expected average distance \(\bar{d}_e\), leading to a false conclusion of randomness, even though the clustering in the city center was visually evident. This example highlights how misrepresenting the true shape of the study area can result in inaccurate or misleading conclusions.
Edge Effects
NNA also assumes edge effects. Points located near the boundaries may have their nearest neighbors just outside the study area, but these are not considered in the analysis. This omission can result in overestimated nearest neighbor distances, which in turn distorts the z-score and test conclusions, particularly in dense urban areas.
Assumption of Homogeneity
NNA also assumes that the entire study area is homogeneous, meaning points can occur anywhere with equal probability. However, this is rarely true in real-world contexts. For example, hospitals are more likely to be located near population centers, not distributed evenly across a region. If we ignore these constraints, NNA might wrongly interpret such spatial organization as clustering.
Scale Sensitivity
NNA is sensitive to scale because it only considers the nearest neighbor distance. It fails to capture more complex spatial structures, like clustering at small scales but dispersion at larger scales. For exemple, bees are clustered within hives (small-scale), while hives are dispersed (large-scale). NNA would detect only the clustering of bees, missing the broader-scale dispersion.
Given these limitations, other spatial analysis techniques Ripley’s K-function are needed, which allow for a better understanding of spatial phenomena, especially when dealing with real-world, heterogeneous environments.
Ripley’s K-function is another spatial statistic for analyzing spatial point patterns. The K-function represents the mean number of points observed within a circle of radius \(d\), standardized by the overall point density in the study area. It firstly draws a circle of radius \(d\) for each point \(s_i\) in the dataset. Then, it calculates the number of other points (events) within each circle and compute the average number of points inside all these circles which reflecting the number of points around a typical event at a given radius \(d\). The K-function \(K(d)\) is then calculated by dividing the average count of points by the overall point density in the study region. The formula for this calculation is:
\[ K(d) = \frac{\frac{1}{n} \sum_{i=1}^{n} \#\left[S \in \text{Circle}(s_i, d)\right]}{\frac{n}{a}} = \frac{\text{Mean number of points in all circles of radius } d}{\text{Mean point density in entire study region } a} \]
Where:
Under a random spatial process (CSR), the expected value of the K-function is \(K(d) = \pi d^2\), which is simply the area of a circle with radius \(d\). If the observed value of \(K(d)\) is greater than \(\pi d^2\), it indicates clustering at scale \(d\). If the observed value is less than \(\pi d^2\), it indicates dispersion at that scale.
For ease of interpretation, the K-function is often transformed into the L-function, which is defined as:
\[ L(d) = \sqrt{\frac{K(d)}{\pi}} - d \]
Under a random spatial process (CSR), the expected value of \(L(d)\) is 0, because when \(K(d) = \pi d^2\), we get:
\[ L(d) = \sqrt{\frac{\pi d^2}{\pi}} - d = d - d = 0 \]
When \(L(d)\) is greater than 0, it indicates clustering at scale \(d\). When \(L(d)\) is less than 0, it indicates dispersion at scale \(d\).
In sArcGIS, the L-function is defined slightly differently, which is just as:
\[ L(d) = \sqrt{\frac{\text{Mean point density at each radius } d}{\text{Mean point density in entire area } a} \cdot \frac{1}{\pi}} = \sqrt{\frac{K(d)}{\pi}} \]
In K-function analysis, the first step is to define the initial
distance \(d_0\), which represents the
smallest distance for evaluating spatial relationships. The next step is
to define a sequence of incremental distances, usually set at equal
intervals (e.g., 10 meters, 20 meters, 50 meters). These incremental
distances represent the various spatial scales at which clustering or
dispersion might occur. Beginning distance is typically
set to 0, representing the starting point of spatial interaction.
Incremental distance determines how frequently the
K-function is evaluated. Common choices for the number of
distance bands are 10 or 20,
depending on the scale and resolution desired. In R,
the r
argument in the spatstat
package allows
us to specify a vector of distances at which the K-function should be
evaluated. The manual advises against overriding the default unless
there is a good reason, as “there is a sensible default.” In K-function
analysis, it is important to define both the beginning
distance and the incremental distance to
evaluate spatial relationships across multiple scales. The
beginning distance \(d_0\) typically starts at 0, representing
the minimum distance at which spatial interaction is measured. The
incremental distance determines the step size between
successive evaluations of the K-function, which is frequently used
numbers like 10 or 20. These values define the range and resolution of
the spatial analysis. Although we can specify a custom vector of
distances in R, the manual advises against doing so unless there’s a
specific reason because there is a sensible
default.
The maximum distance at which the K-functions should be evaluated can also be specifies in R, which should be approximately one-half of the maximum pairwise distance between points in your dataset. Rounding to a convenient value (e.g., 1400 or 1500 instead of 1437) is acceptable for clarity and consistency. Based on the maximum distance, The formula for calculating the incremental distance is:
\[ \text{Increment} = \frac{\frac{1}{2} \cdot \text{Maximum Pairwise Distance}}{\text{Number of Distance Bands}} \]
In K-function or L-function analysis, the goal is to determine whether the observed point pattern significantly deviates from a random pattern (Complete Spatial Randomness, CSR) at various spatial scales. The hypotheses are formulated as follows:
The testing procedure compares the observed values with those generated under the null hypothesis (H₀). To generate random patterns, we assume a point pattern with \(n\) points in the study area. Then, we generate several random point patterns (e.g., 9, 99, or 999) with \(n\) points each, assuming CSR. For each random pattern, we calculate the L-function \(L(d)\) at different distances \(d\) using the formula provided earlier. For each distance \(d\), we calculate the Lower Envelope (\(L^-(d)\)) and Upper Envelope (\(L^+(d)\)) based on the distribution of \(L(d)\) values from the random patterns. The Lower Envelope (\(L^-(d)\)) is the minimum value of \(L(d)\) observed across the random patterns. The Upper Envelope (\(L^+(d)\)) is the maximum value of \(L(d)\) observed across the random patterns.
Next, we compare the observed pattern with the random patterns. For each distance \(d\), we compare the observed value of \(L(d)\) (denoted \(L_{\text{obs}}(d)\)) with the lower and upper envelopes (\(L^-(d)\) and \(L^+(d)\)). If \(L^-(d) \leq L_{\text{obs}}(d) \leq L^+(d)\), we cannot reject the null hypothesis \(H₀\), meaning that the pattern is not significantly different from CSR at distance \(d\). If \(L_{\text{obs}}(d) > L^+(d)\), we reject \(H₀\) in favor of \(Ha₁\), indicating significant clustering at scale \(d\). If \(L_{\text{obs}}(d) < L^-(d)\), we reject \(H₀\) in favor of \(Ha₂\), indicating significant dispersion at scale \(d\).
When analyzing point patterns, points located near the boundary of the study area can present challenges. If we draw a circle of radius \(d\) around this point, about part of the circle will be outside the study area, meaning no other points can fall in that region. In contrast, for a point located entirely within the study area, the circle is fully contained within the study area, and theoretically, other points could be anywhere inside it. This difference between points near the boundary and those inside the area could lead to biased analysis.
To address this issue, Ripley’s Edge Correction is commonly applied. When a point is close to the boundary, Ripley’s Edge Correction compensates for the missing points outside the study area by adjusting the weight of points based on their proximity to the boundary. It checks each point’s distance from the study area boundary and adjusts the weight of its neighbors accordingly. This method works well in rectangular study regions but does not apply to irregular shapes.
Another edge correction method is the Simulate Outer Boundary Values Correction. This method mirrors points across the study area boundary to correct for underestimates near the edges. Points within a distance equal to the maximum distance band of the boundary are mirrored across the edge. These mirrored points are then used to provide more accurate neighbor estimates for points near the boundary.
In our report, Simulate Outer Boundary Values is been used because the boundary of our study area, Philadelphia, is not rectangular.
In real-world spatial data, the assumption that points are evenly distributed over the study area may not hold. Factors such as population density, resource availability, or land use regulations can result in a nonhomogeneous distribution of points. In these cases, nonhomogeneous K-functions can be used, which take it into account to adjust the expected values of the K-function. For example, hospitals often cluster in densely populated areas. If we ignore population distribution and apply a homogeneous K-function, we may incorrectly interpret this natural concentration as spatial clustering. To make a fair comparison, we must account for the background population pattern, which serves as a reference measure guiding where points should be more or less likely to occur.
To incorporate a reference measure into the K-function analysis, we modify the way we generate random point patterns for comparison. We firstly transform the reference measure data into probabilities, where each area’s probability is proportional value to the value of whole study area. Then, these probabilities are used to generate spatially weighted random points, meaning denser areas will be more likely to receive more points. Next, converting the probabilities shapefile into a raster surface, so each pixel carries a value representing the probability of a point falling there. Using this raster, we can generate multiple (e.g., 9, 99, or 999) random point patterns based on this weighted surface. Finally, we calculate the L(d) function for each of these simulated patterns and compare them to the observed point pattern.
After loading the data, we made a quick visualization of the farmers markets in Philadelphia to get a general sense of their spatial distribution. Based on the map showing below, we can see that the farmer’s markets are not evenly distributed across the city. It appears to be clustered in certain areas. Northeastern and Southeastern Philadelphia appear to have less farmers’ markets compared to other regions.
ggplot() +
geom_sf(data = philly, fill = "grey80") +
geom_sf(data = zipcode, fill = NA, color = "white") +
geom_sf(data = market, color = "#c44536", size = 1.5) +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.4)
) +
labs(title = "Farmers Markets in Philadelphia")
The nearest neighbor analysis is a statistical method used to assess the spatial distribution of points in a given area. It helps determine whether the points are randomly distributed, clustered, or dispersed. The analysis calculates the distance between each point and its nearest neighbor, providing insights into the spatial arrangement of the points.
To conduct the nearest neighbor analysis, we first prepare the data
by extracting the point coordinates of the farmers’ markets and
converting them into a ppp
object.
philly_window <- as.owin(st_transform(philly, crs = st_crs(market)))
# Extract point coordinates and convert to `ppp`
market_coords <- st_coordinates(market)
market_pp <- ppp(x = market_coords[,1], y = market_coords[,2], window = philly_window)
After that, we calculate the nearest neighbor distances for each point and compare the observed mean distance to the expected mean distance under the assumption of complete spatial randomness (CSR). The Nearest Neighbor Index (NNI) is calculated as the ratio of the observed mean distance to the expected mean distance. A value of NNI less than 1 indicates clustering, while a value greater than 1 indicates dispersion. A value close to 1 suggests a randomness. We also calculate the z-score and p-value to assess the statistical significance of the observed distribution.
# Nearest neighbor analysis
nnd <- nndist.ppp(market_pp)
# Average Observed Distance
MeanObsDist <- mean(nnd)
# Average Expected Distance
# The expected mean nearest neighbor distance under Complete Spatial Randomness (CSR).
n <- npoints(market_pp)
area <- area.owin(market_pp$window)
MeanExpDist <- 0.5 / sqrt(n / area)
#Standard Error
SE <- 0.26136 / sqrt(n*n / area)
According to the result shown below, the Nearest Neighbor index (NNI) is 0.778, which indicates that the farmers markets are clustered in certain areas. The z-score is -3.345, and the p-value is 0.000, suggesting that the observed pattern is statistically significant clustered. Since the p-value is less than 0.05, we can reject the null hypothesis of Completely Spatial Randomness (CSR) at given confidence level and conclude that the farmers markets in Philadelphia are not randomly distributed. Instead, they are clustered in certain areas.
NNI <- MeanObsDist / MeanExpDist # Nearest Neighbor Index
zscore <- (MeanObsDist - MeanExpDist)/SE #Calculating the z-score
pval<-ifelse(zscore > 0, 1 - pnorm(zscore), pnorm(zscore)) #Calculating the p-value
results <- data.frame(
Metric = c("Nearest Neighbor Index (NNI)", "Z-Score", "P-Value"),
Value = c(round(NNI, 3), round(zscore, 3), round(pval, 7))
)
results %>%
kable("html", col.names = c("Metric", "Value")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Metric | Value |
---|---|
Nearest Neighbor Index (NNI) | 0.778 |
Z-Score | -3.345 |
P-Value | 0.000 |
To conduct the K-function analysis, we first need to prepare the data by calculating the maximum Euclidean distance between points in the dataset. In other words, we would find two points farthest apart from one another and divide that distance by 2. This value will be used as the maximum distance for the K-function analysis.
max.distance <- max(proxy::dist(
data.frame(cbind(x = market_coords[,1], y = market_coords[,2])),
data.frame(cbind(x = market_coords[,1], y = market_coords[,2])),
method = "euclidean"
))
cat("Maximum Distance:", max.distance, "\n")
## Maximum Distance: 56698
The plots below show the Ripley’s K-function. The observed \(K(d)\) in solid line and the theoretical \(K(d)\) in dashed line under the null hypothesis of Complete Spatial Randomness (CSR). The observed \(K(d)\) is significantly higher than the theoretical \(K(d)\), beginning at 54 feet. This suggests that the points are more concentrated than expected under CSR, indicating clustering. Since the gap between the observed and theoretical \(K(d)\) is increases with distance, we can coustering is more pronounced at larger distances.
khat <-Kest(market_pp, rmax=28000, correction="Ripley")
khat_df <- data.frame(
r = khat$r, # Distance values
iso = khat$iso, # Observed K(r) (isotropic)
theo = khat$theo # Theoretical K(r)
)
ggplot(khat_df, aes(x = r)) +
geom_line(aes(y = iso, color = "Observed K(r)"), size = 2, color = "#197278") +
geom_line(aes(y = theo, color = "Theoretical K(r)"), linetype = "dashed", size = 1, color = "#c44536") +
labs(
x = "r (Distance)",
y = "Ripley's K-Function",
title = "Ripley's Estimated K-Function",
color = "Legend"
) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
khat_df$difference <- khat_df$iso - khat_df$theo
threshold_index <- which(khat_df$difference > 0)[1]
if (!is.na(threshold_index)) {
consistent_start <- khat_df$r[threshold_index]
message("The observed K(r) is consistently higher than the theoretical K(r) starting at r = ", consistent_start)
} else {
message("The observed K(r) does not consistently exceed the theoretical K(r) within the given range.")
}
## The observed K(r) is consistently higher than the theoretical K(r) starting at r = 54.6875
We also plot the Ripley’s K-function with Confidence Envelopes to assess the statistical significant of the observed pattern. The confidence envelopes represented the range of values expected under Complete Spatial Randomness (CSR). The observed K-function is shown in solid line, while the lower and upper envelopes are shown in dashed lines. The observed K-function falls outside above the confidence envelopes, indicating that the observed pattern is significantly different from CSR. We can conclude that the pattern is significantly clustered at distances beyond this threshold.
## Generating 9 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8,
## 9.
##
## Done.
Kenv_df <- data.frame(
r = Kenv$r, # Distance values
obs = Kenv$obs, # Observed K-function
lower = Kenv$lo, # Lower envelope
upper = Kenv$hi # Upper envelope
)
ggplot(Kenv_df, aes(x = r)) +
geom_line(aes(y = obs, color = "Observed K(r)"), size = 2, color = "#772e25") +
geom_line(aes(y = lower, color = "Lower Envelope"), linetype = "dashed", size = 1, color = "#197278") +
geom_line(aes(y = upper, color = "Upper Envelope"), linetype = "dashed", size = 1, color = "#c44536") +
labs(
x = "r (Distance)",
y = "Khat(r)",
title = "Ripley's Khat with Confidence Envelopes",
color = "Legend"
) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
Kenv_df$difference_lower <- Kenv_df$obs - Kenv_df$lower
first_below_index <- which(Kenv_df$difference_lower < 0)[1]
if (!is.na(first_below_index)) {
below_start <- Kenv_df$r[first_below_index]
message("The observed K(r) falls below the lower envelope starting at r = ", below_start)
} else {
message("The observed K(r) does not fall below the lower envelope within the given range.")
}
## The observed K(r) does not fall below the lower envelope within the given range.
Then. we proceed to plot the Ripley’s L-function, as it linearizes the K-function for easier interpretation. As shown below, we see that the observed \(L(r) - r\) values are always greater than 0. This indicates that the observed number of points within a distance \(r\) is higher than what is expected under the null hypothesis of Complete Spatial Randomness (CSR).
Since the magnitude of \(L(r)- r\) reflects the degree of clustering with larger values suggest strong clustering, the plot implies that degree of clustering first increased and then decreased as distance increased. This suggests that farmers’ markets in Philadelphia are clustered at smaller distances but outcome more dispersed at larger distances.
lhat <- Lest(market_pp, rmax=28000, correction="Ripley")
lhat_df <- data.frame(
r = lhat$r, # Distance values
L_obs = lhat$iso - lhat$r, # Observed L-function minus r
L_theo = lhat$theo - lhat$r # Theoretical L-function minus r
)
ggplot(lhat_df, aes(x = r)) +
geom_line(aes(y = L_obs, color = "Observed L(r)"), size = 2, color = "#197278") +
geom_line(aes(y = L_theo, color = "Theoretical L(r)"), linetype = "dashed", size = 1, color = "#c44536") +
labs(
x = "r (Distance)",
y = "Ripley's L - r",
title = "Ripley's Estimated L-Function",
color = "Legend"
) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
The final plot shows Ripley’s L-function with Confidence Envelopes. The observed \(L(r) - r\) are consitently above the upper confidence envelope (shaded region) across all distances, indicating that the observed points are more clustered than expected under Complete Spatial Randomness (CSR).
## Generating 9 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8,
## 9.
##
## Done.
L2 <- Lenv
L2_df <- data.frame(
r = L2$r, # Distance values
obs = L2$obs - L2$r, # Adjusted observed L-function
theo = L2$theo - L2$r, # Adjusted theoretical L-function
lo = L2$lo - L2$r, # Lower confidence envelope
hi = L2$hi - L2$r # Upper confidence envelope
)
ggplot(L2_df, aes(x = r)) +
geom_ribbon(aes(ymin = lo, ymax = hi), fill = "grey80", alpha = 0.5) +
geom_line(aes(y = obs, color = "Observed L(r)"), size = 2, color = "#197278") +
geom_line(aes(y = theo, color = "Theoretical L(r)"), linetype = "dashed", size = 1, color = "#c44536") +
labs(
x = "r (Distance)",
y = "L(r) - r",
title = "Ripley's L-Function with Confidence Envelopes",
color = "Legend"
) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
ggplot() +
geom_sf(data = zipcode, aes(fill = Pop2000), color = "white") +
scale_fill_continuous(low = "#e9edc9", high = "#344e41", name= "Population") +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.4)
) +
labs(title = "Philadelphia Population by Zip Code")
Without conducting the analyses, we suspect that the absence of farmers markets in Northeastern Phildelphia and South Philadelphia could be due to low population density in those census tract. If the population sparse, there may be less demand or fewer opportunities for farmers’ markets to attract enough customer. In this came, nonhomogeneous K-function analysis would be a more valuable tool. Unlike the homogeneous K-function, which assumes a uniform distribution of points across the study area, the nonhomogeneous K-function accounts for variations in point density. This allows us to assess clustering or dispersion while considering the underlying population distribution. By incorporating population density as a reference measure, we can better understand how farmers’ markets are distributed relative to the population and identify areas where they may be lacking.
The results from both the Nearest Neighbor Analysis and K-function analysis consistently indicate that the spatial distribution of farmers markets in Philadelphia is significantly clustered. The Nearest Neighbor Index (NNI) is 0.778, with a z-score of -3.345 and a p-value of 0.0000002. These values provide strong statistical evidence to reject the null hypothesis of complete spatial randomness. The K-function analysis supports this conclusion by showing that the observed K(d) begins to exceed the theoretical K(d) at a distance of 54 feet. This divergence continues to increase with distance, indicating significant clustering across multiple spatial scales.
These findings align with initial expectations based on the visual distribution of farmers markets. The point data showed that markets were concentrated in Center City and parts of West Philadelphia, while large areas such as the Northeast and South appeared underserved. Both methods confirmed these visual observations through statistically significant results. This consistency strengthens the reliability of the findings. At the same time, it is necessary to acknowledge the limitations of the methods used. Nearest Neighbor Analysis evaluates only the distance to the closest point and is highly sensitive to the shape of the study area. In a city with irregular boundaries such as Philadelphia, this can result in inaccurate estimates of expected spacing. K-function and L-function analyses offer a more detailed view by examining clustering across different distances. However, they rely on the assumption that points have an equal probability of occurring anywhere within the study area. This assumption is difficult to justify in a city where population density and land use vary significantly. Despite these limitations, the convergence of results across different methods provides strong evidence that the observed pattern is not random.
ggplot() +
geom_sf(data = zipcode, aes(fill = MedIncome), color = "white") +
scale_fill_continuous(low = "#FAF9F6", high = "#197278", name = "Median Income") +
geom_sf(data = market, aes(), color = "#c44536", size = 2) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.4)
) +
labs(
title = "Farmers Markets and Median Household Income by Zip Code")
To examine whether this pattern reflects broader socioeconomic disparities, we compared the spatial distribution of farmers markets to median household income at the ZIP code level using the MedIncome variable from the 2000 census. The overlay shows that higher-income ZIP codes contain more markets, while many lower-income areas, particularly in North, Southwest, and South Philadelphia, have little or no market presence. Although this conclusion is based on visual analysis and not statistical testing, the association appears strong. The absence of farmers markets in lower-income areas suggests inequities in access to healthy food across the city.
These findings support a clear conclusion. Farmers markets in Philadelphia are spatially clustered. This pattern is not the result of random placement. Markets are concentrated in central, higher-income neighborhoods, while peripheral and lower-income communities experience limited access. The conclusion is supported by statistical evidence, consistent across multiple methods, and reinforced by income overlay analysis.
This evidence points to the need for targeted policy action. The City of Philadelphia and its partners should develop an equity-based plan to expand farmers market access in underserved neighborhoods. Priority areas include North, Southwest, and Northeast Philadelphia. New markets should be located based on data about household income, population density, and service gaps. Where permanent markets are not feasible, the city should invest in mobile markets and rotating pop-up locations to ensure reliable access. All market siting should be coordinated with public transit systems to support access for residents without cars. The city should also offer financial and logistical support to vendors willing to operate in high-need areas. These actions will promote food security, support neighborhood economies, and advance goals related to public health and spatial equity.