In this study, I intend to analysis the neighborhoods indicators in the city of Philadelphia at the census tract level. Census tracts closely align with neighborhood boundaries and effectively reflect economic and social dynamics. Compared to block-level data, census tracts have relatively lower margin of errors when using American Community Survey (ACS) data, as the sample size is larger.
However, there are several limitations to this approach. First, neighborhood boundaries don’t always align with the census tract boundaries and may cross multiple tracts. Second, census tracts level analysis fails to notice smaller trends within specific neighborhood blocks. For example, in one particular census tract, one block group may have well transit coverage, while another block group may have limited access to public transportation. This can lead to an overgeneralization of neighborhood trends.
The datasets I used to calculate the economic indicators including following:
Median household income at the census tract level across Philadelphia, gathered from the 2022 5-year American Communities Survey (ACS) data.
The violent crime number and locations across Philadelphia gathered from the Philadelphia Police Department.
The bus ridership data of each individual station across Philadelphia gathered from the SEPTA.
The bus coverage (walking distance)
The median household income is a key indicator for measuring the residents’ neighborhood financial well-being. A neighborhood with a relatively higher median income always indicates residents have sufficient funds to address their daily needs and enjoy a better quality of life. The number of violent crimes is a key indicator of public safety. A lower number of crimes suggests that the neighborhood is safe for everyone. The bus ridership data is a key indicator of public transportation availability. A higher number of bus ridership indicates that the neighborhood has a better public transportation connection and more vibrant economic activities, as visitors and residents frequently flow in and out.
There are several limitations in the datasets selection. First, the median income values cannot fully represent the neighborhoods socio-economic status. For example, a disparity neighborhood with a high median income may still have a high poverty rate. Property values and poverty rate are all important factors need to take into consideration.
Second, the violent crime data only includes the total number crimes of the neighborhoods and not coverage the crime types. For example, residents experience is much different between a neighborhood with a high number of thefts and a neighborhood with a high number of murder.
Third, the bus ridership and the bus coverage data may not fully reveals the neighborhood connectivity and accessibility as well. For instance, a neighborhood with well bus coverage may lack of sufficient infrastructure to support biking or driving.To calculate the neighborhood index, we need to coverage as much of data that could reflects the residents quality of life as possible.
Convert median income into the CRS 4326 and filter out the census tracts with the missing income data
Remove unnecessary columns and convert the data into the CRS 4326
Step 1: Based a latitude and longitude of each stop points, converted the csv into a sf object
Step 2: First, calculate a variable called the
workday ridership
by summing the on and off ridership
together; Second, calculate another new variable called the
weekend ridership
by summing the on and off ridership of
Saturday and Sunday together and divded by 2; Finally, calculate a new
variable called the total ridership
by summing the
workday ridership
and weekend ridership
together, which reflect the neighborhood dynamics for both weekend and
weekdays.
ridership <- ridership %>%
mutate(workday_ridership = Weekday_On + Weekday_Of,
weekend_ridership = (Saturday_O + Saturday_1 + Sunday_Ons + Sunday_Off)/2,
total_ridership = workday_ridership + weekend_ridership)
Step 3: cleaning the data only include the variable necessary for the analysis
income + bus ridership
income_ridership<-income%>%
st_join(ridership)%>%
group_by(GEOID)%>%
summarise(income = mean(incomeE), total_ridership = sum(total_ridership))
income+ riderhip+ violent crime
st_buffer
to calculate the bus
coveragefilter the bus stop points that within the Philadelphia boundary
clipped_index <- st_within(ridership, income, sparse = FALSE)
bus_stops <- ridership[apply(clipped_index, 1, any), ]
In this analysis, 200 meters is used as the walking distance to calculate the bus coverage. The bus coverage is defined as the area that is within 200 meters of the bus station.
intersect the bus coverage with the census tracts
calculate the census tracts area
Calculate coverage area in each census tracts
calculate the bus coverage percentage
fix 0 ridership census tract
income_ridership_crime_bus<-income_ridership_crime_bus%>%
mutate(total_ridership = ifelse(is.na(total_ridership), 0, total_ridership))
categorize each individual variable into 5 categories based on the quantile, and assign the number 1-5 to each of the category
income_ridership_crime_bus<-income_ridership_crime_bus%>%
mutate(income_cat= ntile(income, 5))%>%
mutate(crime_ca= ntile(crime, 5))%>%
mutate(crime_cat= ifelse(crime_ca==1,5,
ifelse(crime_ca==2,4,
ifelse(crime_ca==3,3,
ifelse(crime_ca==4,2,1)))))%>%
select(-crime_ca)%>%
mutate(total_ridership_cat= ntile(total_ridership, 5))
composite and sum up
income_ridership_crime_bus<-income_ridership_crime_bus%>%
mutate(composite = income_cat + crime_cat + total_ridership_cat + coverage)
See the distribution of the composite variable
median_val <- median(income_ridership_crime_bus$composite)
sd_val <- sd(income_ridership_crime_bus$composite)
ggplot(income_ridership_crime_bus, aes(x = composite))+
geom_histogram(binwidth = 1, fill = "skyblue", color = "black")+
labs(title = "Distribution of the composite variable", x = "Composite variable", y = "Frequency",
caption = paste("Standard Deviation:", round(sd_val, 2)))+
geom_vline(xintercept = median_val, color = "red", linetype = "dashed")
Roughly divided the composite variable into 5 categories based the composite varible distribution, median, and standard deviation.
ggplot(income_ridership_crime_bus, aes(fill = composite_cat))+
geom_sf()+
scale_fill_manual(values = c(
"Very Low" = "#0081a7",
"Low" = "#00afb9",
"Medium" = "#fdfcdc",
"High" = "#fed9b7",
"Very High"="#f07167"),
breaks = c("Very Low", "Low", "Medium", "High", "Very High")
)+
labs(title = "Neighborhood Index in Philadelphia",
subtitle= "Based on bus ridership and coverage, median income and violent crime number at census tract level", fill = "Neighborhood Index")+
theme(
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 13, face = "italic"),
plot.title = element_text(size = 35, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
longer<-income_ridership_crime_bus%>%
pivot_longer(cols = c(income_cat, total_ridership_cat, crime_cat, coverage), names_to = "Predictors", values_to = "Values")
ggplot(longer, aes(fill = Values))+
geom_sf()+
facet_wrap(~Predictors, ncol = 2)+
labs(title = "Neighborhood Indicators in Philadelphia",
subtitle= "Based on bus ridership and coverage, median income and violent crime number at census tract level",
fill = "Values")+
scale_fill_gradientn(
colors = c( "#0081a7", "#00afb9", "#fdfcdc", "#fed9b7", "#f07167"),
limits = c(1,5),
breaks = c(1,5), # Where to place tick marks
labels = c("Low", "High")) +
theme(
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 13, face = "italic"),
plot.title = element_text(size = 35, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
The calculating process went smoothly. The only downside is that it takes too long to calculate the buffer of the bus stops and bus coverage. Also, compare to desktop GIS, like ArcGIS and QGIS, the spatial analysis using R has pros and cons. Using R, we can wrangling data more quickly, especially when dealing with multiple datasets. However, we cannot visualize the layer directly in R, especially during the processing stage.
The spatial patterns of the composite neighborhood index generally make sense. The center city Philadelphia and South Philadelphia generally have better transit connectivity, more vibrant population flows during the day, and generally higher income. In comparison, the Northeastern Philadelphia generally have lower income, less transit connectivity, and higher crime numbers.
In general, I want the index to measure the overall quality of life in each neighborhood. I think the index capture several key factors contributing to people’s quality of life, including income, public transportation accessibility and connectivity, and public safety. It could be useful to people who are new to Philadelphia and want to choose a safe place to live.
However, the index may not fully capture the neighborhood’s quality of life. For example, the index does not include the housing price, poverty rate, and other factors.
But overall, I think the index did a great job in get a people a sense of Philadelphia.
As mentioned before, the index cannot get the detailed vibe within each census tracts. There may be some internal differences within each census tracts. Second, only four variable could not cover everything people need to know about the neighborhood. For example, the index does not include the housing price, poverty rate, and other factors. Lastly, some predictors or index variables are closely correlated. For example, the bus coverage and bus ridership are closely correlated, as a neighborhood with better bus coverage may have higher bus ridership. This may lean to issues of multicollinearity.