Keywords: K-means clustering, scree plot, unsupervised learning
GitHub Repository: MUSA5000-K-Means-Clustering | Website | Back to Main Page
In this report, we use the K-means clustering algorithm to analyze
five socio-economic variables for more than 1,700 block groups in
Philadelphia, including MEDHVAL
, PCTBACHMOR
,
MEDHHINC
, PCTVACANT
, and
PCTSINGLES
. The goal is to identify distinct clusters of
block groups based on these variables. The K-means clustering method
simplifies the analysis by grouping block groups with similar
characteristics into clusters. This approach allows us to better
understand the socio-economic landscape of Philadelphia and identify
areas with similar characteristics.
K-means clustering offers several benefits: it reduces the complexity by segmenting neighborhoods into clear typologies, highlights correlations between income, education, and housing types, and organizes the data into meaningful clusters. This analysis addresses key questions, such as identifying distinct neighborhood groups, understanding defining factors for each cluster, and exploring how income and education relate to vacancy rates of housing types.
Additionally, this approach would help urban planners and policy makers to make target interventions for disadvantage clusters or neighborhoods, such as affordable housing or economic revitalization efforts. K-means clustering provides actionable insights for data-driven decision-making.
The K-Means algorithm is a popular clustering method that partitions data into K distinct clusters based on feature similarity. In essence, the calculation is a 6-step iterative process that involves the following steps:
These assignment and update steps are alternate until the centroids stabilized, meaning that they no longer change significantly or the maximum number of iterations is reached. The final result is a set of k clusters, each represented by its centroid, and a set of data points assigned to each cluster. Ultimately, the K-means algorithm aims to minimize the within-cluster sum of square errors SSE , which is the sum of squared distance between each observation and the centroid of the cluster into which it fails.
While K-means is effective and easy to implement, it comes with notable limitations:
In addition to K-means, there are several other clustering methods that can be used to identify patterns in data. Among those, most popular ones are Hierarchical Clustering and DBSCAN:
The choice of clustering algorithm depends on the characteristics of the data and research goals. For datasets with irregular cluster shapes or significant outliers, DBSCAN is a strong choice. If the researcher seeks a hierarchical representation or does know the number of cluster beforehand, hierarchical clustering is more suitable. In this case, spatial data like Philadelphia’s block groups, where cluster boundaries might not be spherical, DBSCAN or hierarchical clustering could be more appropriate. However, K-means is still a valuable tool for its simplicity and interpretability, especially when the number of clusters is known and the data is well-scaled.
NbClust
ResultsTo decide the optimal number of clusters for K-means clustering, we first made a scree plot of the within-cluster sum of squares (WSS) for different number of clusters. The scree plot helps identify the “elbow” point, which indicates the optimal number of clusters. We notice that there is a significant inflection point at 2 clusters, suggesting that 2 is a optimql choice for the number of clusters.
df <- data.frame(scale(data[-1]))
wss <- (nrow(df)-1)*sum(apply(df,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(df,
centers=i)$withinss)
plot_data <- data.frame(
Clusters = 1:20,
WSS = wss
)
ggplot(plot_data, aes(x = Clusters, y = WSS)) +
geom_line(color = "#7dcfb6", size = 1) +
geom_point(color = "#f5641b", size = 1.5) +
labs(
title = "Scree Plot for Identifying Optimal Clusters",
x = "Number of Clusters",
y = "Within-Group Sum of Squares"
) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
The NbClust package provides a comprehensive set of 30 indices to determine the optimal number of clusters. These indices include the Silhouette method, Gap Statistic, and Dunn index, among others. We use the 26 most popular indices to determine the optimal number of clusters.
The NbClust function aggregated results from multiple clustering indices. Among those, we see that 8 indices recommended 2 clusters while fewer indices recommend alternative numbers of clusters. Based on majority rule, this suggests that 2 clusters is a robust choice for the number of clusters in this dataset.
In addition, the D index, which measure the second differences within-cluster variances for different clusters and help identify the elbow where adding more clusters could not significantly improve the fit. In this case, the D index also suggests that 2 clusters is a good choice.
Therefore, based on the majority recommendation in NbClust, we choose 2 clusters as optimal number of clusters for K-means clustering.
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 3 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 6 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
The plot below shows the number of clusters chosen by all 26 criteria mentioned above. The most popular choice is 2 clusters, which is recommended by 8 criteria. The second most popular choice is 15 clusters, which is recommended by 6 criteria. The third largest number of the criteria recommend either 3 or 4 clusters.
best_n_table <- as.data.frame(table(nc$Best.n[1,]))
colnames(best_n_table) <- c("Clusters", "CriteriaCount")
ggplot(best_n_table, aes(x = Clusters, y = CriteriaCount)) +
geom_bar(stat = "identity", fill = "#f9ca0a", color = NA) +
labs(
x = "Number of Clusters",
y = "Number of Criteria",
title = "Number of Clusters Chosen by 26 Criteria"
) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
We ran the K-means clustering algorithm with optimal number of clusters 2 and 25 random starts. The random starts help to avoid local minima and ensure that the algorithm converges to a global minimum. The final cluster assignments are based on the best solution found across all random starts. The table below shows the sizes of each cluster, which indicates the number of observations assigned to each cluster. The first cluster contains 274 observations, while the second cluster contains 1,446 observations. The outcome of two clusters are not balanced, with the second cluster containing significantly more observations than the first one. This imbalance may indicate that the second cluster is more prevalent in the dataset.
set.seed(12)
fit.km <- kmeans(df, 2, nstart = 25)
cluster_sizes <- data.frame(
Cluster = 1:length(fit.km$size),
Size = fit.km$size
)
cluster_sizes %>%
kbl(
caption = "Cluster Sizes from K-Means Clustering",
col.names = c("Cluster", "Number of Points"),
align = "c"
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Cluster | Number of Points |
---|---|
1 | 274 |
2 | 1446 |
The table below summarizes our K-means cluster results in the five variables. Based on the result, all points were divided into two clusters based on their spacial distribution, with 274 and 1446 observations in Clusters 1 and 2 respectively.
cluster_summary <- cbind(
round(aggregate(data[-1], by = list(Cluster = fit.km$cluster), mean), 1),
Size = fit.km$size
)
cluster_summary %>%
kbl(caption = "Summary of K-Means Clustering Results",
col.name=c('Cluster','MEDHVAL', "PCTBACHMOR", "MEDHHINC", "PCTVACANT", "PCTSINGLES", "Size")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Cluster | MEDHVAL | PCTBACHMOR | MEDHHINC | PCTVACANT | PCTSINGLES | Size |
---|---|---|---|---|---|---|
1 | 152495 | 46.9 | 51983 | 4.8 | 22.3 | 274 |
2 | 49952 | 10.2 | 27668 | 12.5 | 6.8 | 1446 |
As shown in the table, Cluster 1 represents neighborhoods with relatively high socioeconomic status, with a suitable label might as Gentrified Neighborhoods. With a median home value over $150,000 and 46.9% of residents holding at least a bachelor’s degree, these areas are clearly more affluent. The median household income is approximately 52,000 dollars. In addition, the low vacancy rate (4.8%) and a relatively high percentage of single-person households (22.3%) suggest an urbanized, in-demand housing environment that may attract younger, more educated professionals. Cluster 1 contains 274 observations, a smaller group that aligns with the idea that high-income urban tracts tend to be fewer but denser.
In contrast, Cluster 2 appears to reflect communities of lower socioeconomic status. This group could be labeled Disadvantaged Neighborhoods. It has a significantly lower median household income of $27,668, and only 10.2% of residents hold a bachelor’s degree or higher. The housing vacancy rate is substantially higher (12.5%), more than twice that of Cluster 1, which may indicate housing instability or disinvestment. Additionally, only 6.8% of residents live alone, pointing to more traditional or family-oriented household structures. With 1,446 observations, this cluster is much larger, suggesting that such communities are more prevalent and geographically widespread.
The solutions appears to make sense given the clear distinctions in socioeconomic and demographic variables between the two clusters. Variables such as median home value, percentage of residents with a bachelor’s degree, and median household income are align with expected patterns for economically disadvantaged versus affluent areas In addition, Additionally, differences in percentage of vacant home and percentage of single family units reinforce these interpretations, where wealthier areas typically have lower vacancy rates and a higher proportion of single family households.
From the distribution map below, there are clear spatial patterns in the distribution of K-means clusters, indicating spatial autocorrelation in the clusters. Observations falling into the same cluster tend to group geographically rather than being randomly distributed across the city. We can also observe that neighborhoods in Cluster 1 tend to concentrate in Center City , North and Northwest Philadelphia—areas commonly associated with economic development and gentrification. This spatial pattern further reinforces the validity of the cluster interpretation, as it aligns with known trends in urban socio-spatial inequality.
In summary, the map emphasizes how socioeconomic disparities are not only present in the data but also manifest in the physical landscape of Philadelphia. The clustering results highlight the need for targeted policies and interventions to address the challenges faced by disadvantaged neighborhoods while also considering the dynamics of gentrification in more affluent areas.
cluster_assignments <- data.frame(Observation = 1:nrow(df), ClusterID = fit.km$cluster)
philly_shape_with_clusters <- philly_shape %>%
left_join(cluster_assignments, by = c("POLY_ID" = "Observation"))
ggplot(philly_shape_with_clusters) +
geom_sf(aes(fill = as.factor(ClusterID)), color = NA) + # Set color to NA to remove stroke
scale_fill_manual(values = c("#7dcfb6", "#f5641b")) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.4)
) +
labs(title = "K-Means Clustering of Philadelphia Census Blocks",
subtitle = "2 Clusters",
fill = "Cluster ID")
Building on the clustering and spatial analysis presented above, this section reflects on observed patterns and notable outcomes.
The K-means clustering analysis grouped Philadelphia’s block groups into five clusters based on median house value, median household income, percent of residents with a bachelor’s degree or more, percent of single-detached housing units, and percent of vacant housing units. The summary statistics for each cluster revealed distinct patterns in how these indicators vary across neighborhoods.
Clusters with higher median incomes generally also had higher housing values and higher levels of educational attainment. Conversely, clusters with lower median incomes tended to have lower educational attainment and higher vacancy rates. These differences suggest that income, education, and vacancy levels are correlated in the data and that K-means was effective in separating areas based on these relationships.
The map of cluster assignments showed that some clusters occupy larger contiguous areas of the city while others appear in smaller or more fragmented patterns. This suggests that certain neighborhood types are geographically concentrated. However, without neighborhood labels or more detailed geographic analysis, further interpretation of the spatial distribution remains limited.
One somewhat surprising finding was the wide variation in vacancy rates across clusters. One cluster had an average vacancy rate above 20 percent while another was below 10 percent. This highlights vacancy as a strong differentiating factor across neighborhoods, more than might have been expected. Another notable result was the presence of a cluster with moderate values across all five indicators. Unlike the other groups, which tended to reflect either higher or lower socioeconomic status across variables, this cluster showed a more balanced profile. Its presence suggests the existence of neighborhoods that do not fit neatly into a high or low category and may be in a stable or transitional phase.
Overall, this exercise revealed meaningful distinctions in neighborhood characteristics across Philadelphia. The patterns in the cluster summaries were clear, and the spatial visualization provided additional context that enhanced interpretation. While the results largely aligned with expectations about how socioeconomic indicators interact, the emergence of transitional or mixed-profile clusters offered new insight. This analysis demonstrates how unsupervised learning methods like K-means can help uncover underlying structure in urban data and support more nuanced understanding of neighborhood dynamics.