Select the Data

Neighborhood boundaries and analysis units/levels

In this study, I intend to analysis the neighborhoods indicators in the city of Philadelphia at the census tract level. Census tracts closely align with neighborhood boundaries and effectively reflect economic and social dynamics. Compared to block-level data, census tracts have relatively lower margin of errors when using American Community Survey (ACS) data, as the sample size is larger.

However, there are several limitations to this approach. First, neighborhood boundaries don’t always align with the census tract boundaries and may cross multiple tracts. Second, census tracts level analysis fails to notice smaller trends within specific neighborhood blocks. For example, in one particular census tract, one block group may have well transit coverage, while another block group may have limited access to public transportation. This can lead to an overgeneralization of neighborhood trends.

Datasets selection

The datasets I used to calculate the economic indicators including following:

  • Median household income at the census tract level across Philadelphia, gathered from the 2022 5-year American Communities Survey (ACS) data.

  • The violent crime number and locations across Philadelphia gathered from the Philadelphia Police Department.

  • The bus ridership data of each individual station across Philadelphia gathered from the SEPTA.

  • The bus coverage (walking distance)

Load ACS median income data

# Load the ACS data
income<- get_acs(
  geography = "tract",
  variables = c(income= "B19013_001"),
  state = "PA",
  county = "Philadelphia",
  year = 2022,
  survey = "acs5",
  output = "wide",
  geometry = TRUE
)

Load violent crime data and ridership data

# Load the violent crime data
incident<- st_read("data/incident/incidents_part1_part2.shp")
# load the ridership data
ridership <- read_csv("data/bus.csv")

Justification

The median household income is a key indicator for measuring the residents’ neighborhood financial well-being. A neighborhood with a relatively higher median income always indicates residents have sufficient funds to address their daily needs and enjoy a better quality of life. The number of violent crimes is a key indicator of public safety. A lower number of crimes suggests that the neighborhood is safe for everyone. The bus ridership data is a key indicator of public transportation availability. A higher number of bus ridership indicates that the neighborhood has a better public transportation connection and more vibrant economic activities, as visitors and residents frequently flow in and out.

Limitations

There are several limitations in the datasets selection. First, the median income values cannot fully represent the neighborhoods socio-economic status. For example, a disparity neighborhood with a high median income may still have a high poverty rate. Property values and poverty rate are all important factors need to take into consideration.

Second, the violent crime data only includes the total number crimes of the neighborhoods and not coverage the crime types. For example, residents experience is much different between a neighborhood with a high number of thefts and a neighborhood with a high number of murder.

Third, the bus ridership and the bus coverage data may not fully reveals the neighborhood connectivity and accessibility as well. For instance, a neighborhood with well bus coverage may lack of sufficient infrastructure to support biking or driving.To calculate the neighborhood index, we need to coverage as much of data that could reflects the residents quality of life as possible.

Process & clean the Data

Cleaning the income data

Convert median income into the CRS 4326 and filter out the census tracts with the missing income data

income <- income %>%
  st_transform(crs = 4326) %>%
  filter(!is.na(incomeE))%>%
  select(-NAME)

Cleaning the violent crime data

Remove unnecessary columns and convert the data into the CRS 4326

incident <- incident %>%
  st_transform(crs = 4326) %>%
  select(text_gener, geometry)

Cleaning the ridership data

Step 1: Based a latitude and longitude of each stop points, converted the csv into a sf object

ridership <- ridership %>%
  st_as_sf(coords = c("Lon", "Lat"), crs = 4326)

Step 2: First, calculate a variable called the workday ridership by summing the on and off ridership together; Second, calculate another new variable called the weekend ridership by summing the on and off ridership of Saturday and Sunday together and divded by 2; Finally, calculate a new variable called the total ridership by summing the workday ridership and weekend ridership together, which reflect the neighborhood dynamics for both weekend and weekdays.

ridership <- ridership %>%
  mutate(workday_ridership = Weekday_On + Weekday_Of,
         weekend_ridership = (Saturday_O + Saturday_1 + Sunday_Ons + Sunday_Off)/2,
         total_ridership = workday_ridership + weekend_ridership)

Step 3: cleaning the data only include the variable necessary for the analysis

ridership <- ridership %>%
  select(total_ridership, workday_ridership, weekend_ridership, geometry)

Perfrom Spatial Operations

Step 1: Sptial join all data together

income + bus ridership

income_ridership<-income%>%
  st_join(ridership)%>%
  group_by(GEOID)%>%
  summarise(income = mean(incomeE), total_ridership = sum(total_ridership))

income+ riderhip+ violent crime

income_ridership_crime<-income_ridership%>%
  st_join(incident)%>%
  group_by(GEOID)%>%
  summarise(income = mean(income), total_ridership = sum(total_ridership), crime = n())

Step 2: Using st_buffer to calculate the bus coverage

filter the bus stop points that within the Philadelphia boundary

clipped_index <- st_within(ridership, income, sparse = FALSE)

bus_stops <- ridership[apply(clipped_index, 1, any), ]

In this analysis, 200 meters is used as the walking distance to calculate the bus coverage. The bus coverage is defined as the area that is within 200 meters of the bus station.

bus_coverage <- bus_stops %>%
  st_buffer(dist = 200) %>%
  st_union()

Step 3: Calculate the bus coverage percentage

intersect the bus coverage with the census tracts

census_tracts <- st_make_valid(income)
bus_coverage <- st_make_valid(bus_coverage)
intersect <- st_intersection(census_tracts, bus_coverage)

calculate the census tracts area

census_tracts$tract_area <- st_area(census_tracts)

Calculate coverage area in each census tracts

intersect$coverage_area <- st_area(intersect)

calculate the bus coverage percentage

intersect$coverage_percentage <- intersect$coverage_area / census_tracts$tract_area *100

Step 4: Categorize the bus coverage

intersect$percentage<- as.numeric(intersect$coverage_percentage)

bus_area<- intersect%>%
  mutate(coverage = ntile(percentage, 5))%>%
  select(GEOID,percentage,coverage)

Step 5: Join the bus coverage percentage back

bus_area<- bus_area%>%
  st_drop_geometry()

income_ridership_crime_bus<-income_ridership_crime%>%
  left_join(bus_area, by = "GEOID")

Step 6: Composite and categorize all individual variable together

fix 0 ridership census tract

income_ridership_crime_bus<-income_ridership_crime_bus%>%
  mutate(total_ridership = ifelse(is.na(total_ridership), 0, total_ridership))

categorize each individual variable into 5 categories based on the quantile, and assign the number 1-5 to each of the category

income_ridership_crime_bus<-income_ridership_crime_bus%>%
  mutate(income_cat= ntile(income, 5))%>%
  mutate(crime_ca= ntile(crime, 5))%>%
  mutate(crime_cat= ifelse(crime_ca==1,5,
                           ifelse(crime_ca==2,4,
                                  ifelse(crime_ca==3,3,
                                         ifelse(crime_ca==4,2,1)))))%>%
  select(-crime_ca)%>%
  mutate(total_ridership_cat= ntile(total_ridership, 5))

composite and sum up

income_ridership_crime_bus<-income_ridership_crime_bus%>%
  mutate(composite = income_cat + crime_cat + total_ridership_cat + coverage)

See the distribution of the composite variable

median_val <- median(income_ridership_crime_bus$composite)

sd_val <- sd(income_ridership_crime_bus$composite)

ggplot(income_ridership_crime_bus, aes(x = composite))+
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black")+
  labs(title = "Distribution of the composite variable", x = "Composite variable", y = "Frequency",
       caption = paste("Standard Deviation:", round(sd_val, 2)))+
  geom_vline(xintercept = median_val, color = "red", linetype = "dashed")

Roughly divided the composite variable into 5 categories based the composite varible distribution, median, and standard deviation.

income_ridership_crime_bus<-income_ridership_crime_bus%>%
  mutate(composite_cat= case_when(composite <= 7 ~ "Very Low",
                                  composite > 7 & composite <= 10 ~ "Low",
                                  composite > 10 & composite <= 13 ~ "Medium",
                                  composite > 13 & composite <= 16 ~ "High",
                                  composite > 16 ~ "Very High"))

Visulization and Mapping

Composite Neighborhood Index Map

ggplot(income_ridership_crime_bus, aes(fill = composite_cat))+
  geom_sf()+
  scale_fill_manual(values = c(
    "Very Low" = "#0081a7",
    "Low" = "#00afb9",
    "Medium" = "#fdfcdc",
    "High" = "#fed9b7",
    "Very High"="#f07167"),
    breaks = c("Very Low", "Low", "Medium", "High", "Very High")
    )+
  labs(title = "Neighborhood Index in Philadelphia",
       subtitle= "Based on bus ridership and coverage, median income and violent crime number at census tract level", fill = "Neighborhood Index")+
  theme(
        legend.text = element_text(size = 10),
        legend.title = element_text(size = 14),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 13, face = "italic"),
        plot.title = element_text(size = 35, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

Map series shwoing individual predictors

longer<-income_ridership_crime_bus%>%
  pivot_longer(cols = c(income_cat, total_ridership_cat, crime_cat, coverage), names_to = "Predictors", values_to = "Values")

ggplot(longer, aes(fill = Values))+
  geom_sf()+
  facet_wrap(~Predictors, ncol = 2)+
  labs(title = "Neighborhood Indicators in Philadelphia",
       subtitle= "Based on bus ridership and coverage, median income and violent crime number at census tract level",
       fill = "Values")+
  scale_fill_gradientn(
    colors = c( "#0081a7",  "#00afb9", "#fdfcdc", "#fed9b7", "#f07167"),
    limits = c(1,5),
    breaks = c(1,5),          # Where to place tick marks
    labels = c("Low", "High")) +
  theme(
        legend.text = element_text(size = 10),
        legend.title = element_text(size = 14),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 13, face = "italic"),
        plot.title = element_text(size = 35, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

Discussion

Q1: How things went? What spatial patterns do you observe and do they make sense?

The calculating process went smoothly. The only downside is that it takes too long to calculate the buffer of the bus stops and bus coverage. Also, compare to desktop GIS, like ArcGIS and QGIS, the spatial analysis using R has pros and cons. Using R, we can wrangling data more quickly, especially when dealing with multiple datasets. However, we cannot visualize the layer directly in R, especially during the processing stage.

The spatial patterns of the composite neighborhood index generally make sense. The center city Philadelphia and South Philadelphia generally have better transit connectivity, more vibrant population flows during the day, and generally higher income. In comparison, the Northeastern Philadelphia generally have lower income, less transit connectivity, and higher crime numbers.

Q2: How well does your index capture what you are trying to measure and how useful do you think it is?

In general, I want the index to measure the overall quality of life in each neighborhood. I think the index capture several key factors contributing to people’s quality of life, including income, public transportation accessibility and connectivity, and public safety. It could be useful to people who are new to Philadelphia and want to choose a safe place to live.

However, the index may not fully capture the neighborhood’s quality of life. For example, the index does not include the housing price, poverty rate, and other factors.

But overall, I think the index did a great job in get a people a sense of Philadelphia.

Q3: What limitations are there to your index.

As mentioned before, the index cannot get the detailed vibe within each census tracts. There may be some internal differences within each census tracts. Second, only four variable could not cover everything people need to know about the neighborhood. For example, the index does not include the housing price, poverty rate, and other factors. Lastly, some predictors or index variables are closely correlated. For example, the bus coverage and bus ridership are closely correlated, as a neighborhood with better bus coverage may have higher bus ridership. This may lean to issues of multicollinearity.