Accurately predicting home prices is essential for stakeholders, such as buyers, investors, and policymakers, as it includes important financial decisions, urban planning, and policy making initiatives. However, home price predictions remain a difficult tasks, as they are influenced by a wide range of factors, including the housing characteristics, neighborhood amenities, and the overall neighborhood economic vibration. In this analysis, I build a predictive model for home prices in Charlotte, NC, using a dataset of housing information in Mecklenburg County, NC. The model will be based on the internal and external or spatial factors to predict the housing prices. The internal factors may include some housing basic characteristics, like footage, number of bedrooms, etc. The external factors may include the crime rate, median household income, the coverage of the public transportation,and access to the park and public spaces. By identifying and engineering such factors, our predictive model aims to minimize the prediction errors and provides deeper insights into the housing market in Charlotte, NC.
My analytical approach start with loading and cleaning the providing existing housing datasets. I first explored the datasets and identify the missing values and potential outliers. Then, utilizing the feature engineering process, I created new features from the existing features to improve the model performance, including the home ages and the dummy variables for the air conditioning and heating system. Additionally, spatial data and features were gathered from the American Community Survey, and open data portal, including crimes data, proximity to recreational spaces, and tree canopy. Following data preparation, I partitioned the data into training and testing subsets using an 80-20 split to reliably assess the model’s predictive performance. The model was developed using Ordinary Least Squares (OLS) regression and tested across various combinations of variables. The final model was evaluated using diagnostic metrics (MAE, RMSE, R-squared) and spatial residual analyses, including Moran’s I test, to detect spatial autocorrelation in residuals. The analysis concludes by addressing model limitations and potential to improvements.
In this section, I clean the existing housing datasets and explore other sources of data that are important for the model building process. Second, I check the correlation between the variables to identify the multicollinearity between the variables, the linearity relationship between key predictors and the dependent variable, and the visual sense of autocorrelation in the data.
The first step for this analysis start by loading the household data of the Charlotte, NC area. The data is in the geojson format with over 46,000 records of the housing information in Mecklenburg County, NC.
The original dataset contain many information that are not closely related with the housing prices, such as zip code, owner’s name, tax code. I filtered the datasets that only contains information about the price, bedrooms, yearbuilt, fullbaths, halfbaths, heatedarea, storyheigh, numfirepla, bldggrade, units, heatedfuel,and actype of the houses. The data is loaded and cleaned by removing the missing values and the houses with price less than 0. For prediction purposes, the price equal or below 0 is not useful.
house<- house %>%
dplyr::select(price, bedrooms,yearbuilt,fullbaths, halfbaths, heatedarea, storyheigh, numfirepla, bldggrade,units, heatedfuel, actype) %>%
filter(!is.na(price) & !is.na(bedrooms) & !is.na(yearbuilt) & !is.na(fullbaths) & !is.na(halfbaths) & !is.na(heatedarea) & !is.na(storyheigh) & !is.na(numfirepla) & !is.na(bldggrade) & !is.na(units) & !is.na(heatedfuel) & !is.na(actype))%>%
filter(price>0)
After checking housing prices, a small number of homes in Charlotte have exceptionally high values, representing clear outliers. These extreme values could significantly distort our model’s ability to accurately predict typical housing prices. Therefore, I have excluded all homes valued above 10 million dollars to ensure the robustness and accuracy of our predictive model.
After the primary filtering, the dataset contains about 35,000 records of housing information in Mecklenburg County, NC.
For the prediction purpose, the normal distribution of the variable matters because it helps ensure the accuracy and reliability of the model. When the variable is normally distributed, it usually means that the erroes or residuals are also normally distributed.
The histogram of the price shows that the price is right skewed. To make the price more normally distributed, I took the log transformation of the price. The histogram of the log price shows that the log price is more normally distributed than the original price. I used the log price as the dependent variable in the model building process.
ggplot(house, aes(x = price)) +
geom_histogram(fill = "lightblue", bins= 500) +
labs(title = "Histogram of Price", x = "Price", y = "Frequency")
house <- house %>%
mutate(logprice = log(price))
ggplot(house, aes(x = logprice)) +
geom_histogram(fill = "lightblue", bins= 500) +
labs(title = "Histogram of Log Price", x = "Log Price", y = "Frequency")
The feature engineering is the process of creating new features from
the existing features. In this analysis, I created the age of the house,
from the housing built year. I also created the dummy variables for the
air conditioning and heating system, and the categorical variable
housing type
. The dummy variables for the air conditioning
and heating system are created by checking if the house has air
conditioning or heating system, as housing with air conditioning and
heating system tends to have higher price. The type variable is created
by categorizing the house as mansion, castle and normal house. The house
has more than 5000 square feet of heated area, more than 3 bedrooms, and
more than 3 full baths ia categorized as mansion. The house has more
than 10000 square feet of heated area, more than 5 bedrooms, and more
than 5 full baths is categorized as castle. The rest of the houses are
categorized as normal house.
The building grade is converted to a numeric variable by assigning a number to each category based on factors. The new features are created to improve the model performance.
Grade | Description | Typical Examples | Quality & Value Level |
---|---|---|---|
Minimum | Very basic, minimal standards,limited durability. | Old sheds, simple outbuildings, deteriorating structures | Lowest |
Fair | Below-average condition, noticeable wear or deferred maintenance. | Older homes/apartments with deferred maintenance, structures needing renovation | Below Average |
Average | Standard construction quality, functional and maintained condition. | Typical single-family homes, standard apartment units | Moderate |
Good | Above-average quality construction, good condition. | Newer suburban homes, renovated units, higher-quality commercial buildings | Above Average |
Excellent | Superior quality, exceptional finishes, and meticulous attention to detail. | High-end residences, luxury condominiums, upscale commercial properties | High |
Custom | Unique or specialized construction, highly personalized design,typically luxury or architecturally significant. | Luxury custom homes, unique historical restorations, architect-designed high-end properties | Highest or Specialized |
Source: https://localdocs.charlotte.edu/Tax_Collections/Reports_Studies/
house <- house %>%
mutate(
age = 2022 - yearbuilt,
storyheigh = as.numeric(gsub("[^0-9.]", "", storyheigh)),
bldggrade = case_when(
bldggrade == "MINIMUM" ~ 1,
bldggrade == "FAIR" ~ 2,
bldggrade == "AVERAGE" ~ 3,
bldggrade == "GOOD" ~ 4,
bldggrade == "VERY GOOD" ~ 5,
bldggrade == "EXCELLENT" ~ 6,
bldggrade == "CUSTOM" ~ 7),
ac_dummy = ifelse(actype == "AC-NONE", 0, 1),
heat_dummy = ifelse(heatedfuel == "NONE", 0, 1),
mansion= ifelse(heatedarea>5000 & heatedarea<10001 & bedrooms>2 & bedrooms<6 & fullbaths>2, 1, 0),
castle= ifelse(heatedarea>10000 & bedrooms>5 & fullbaths>5, 1, 0),
house= ifelse(mansion==0 & castle==0, 1, 0),
heatfuel = ifelse(heatedfuel == "NONE", NA, heatedfuel)
)
house <- house %>%
dplyr::select(-yearbuilt, -actype, -heatedfuel)%>%
mutate(type= case_when(
house==1 ~ "House",
mansion==1 ~ "Mansion",
castle==1 ~ "Castle"
))
Many other factors can affect the housing price that not contains in the given datasets. Many other spatial variables affect the housing price and value as well. For example, the housing values are closely correlate with the crime rates in the neighborhood, median household income and whether the house is close to the public transportation and access to the park and public spaces. All those variables should be included in the prediction model to improve the model performance.
The additional spatial data is collected from American Community Survey, City of Charlotte open data portal and other sources. The additional spatial data includes the crime rate, the distance to the public transportation, the distance to the park. The additional spatial data is collected and cleaned to be used in the model building process.
Median household income data was collected from the American Community Survey at the census block group level. The median household income values were then assigned to individual housing data points based on their geographic location.
income<- get_acs(
geography = "block group",
variables = c("income"="B19013_001"),
state = "NC",
county = "Mecklenburg",
geometry = TRUE
)%>%
st_transform(crs =st_crs(house))%>%
filter(!is.na(estimate))%>%
dplyr::select(estimate, GEOID, geometry)%>%
rename(income=estimate)
#spatial join the income data to the house data
house <- st_join(house, income)%>%
filter(!is.na(income))
Crime data was collected from the City of Charlotte open data portal from police station. And then, the crime data is aggregated at the census block group level same as the household income data. Finally, the crime data was spatially joined to the house data based on the house geolocation.
crime<- read.csv("data/CMPD.csv")%>%
filter(YEAR>=2021)
crime<- crime %>%
dplyr::select(LATITUDE_PUBLIC, LONGITUDE_PUBLIC, YEAR)%>%
st_as_sf(coords = c("LONGITUDE_PUBLIC", "LATITUDE_PUBLIC"), crs = 4326)%>%
st_transform(crime, crs =st_crs(house))
crime_tract<- st_join(house, crime)%>%
group_by(GEOID)%>%
summarise(crime=n())
house <- st_join(house, crime_tract)%>%
dplyr::select(-GEOID.x, -GEOID.y)%>%
filter(!is.na(crime))
Public transportation coverage data was collected from the Charlotte/Mecklenburg Quality of Life Explorer. This dataset measures the percentage of housing units located within 0.5 miles of a transit stop at the neighborhood level as defined by Mecklenburg County. The neighborhood-level transit coverage values were then assigned to individual housing data points based on their geographic location.
transit <- st_read("data/transportation.geojson")%>%
dplyr::select(X2023, geometry)%>%
rename(transit=X2023)
#spatial join the transit data to the house data
house <- st_join(house, transit)%>%
filter(!is.na(transit))
Tree Canopy data was collected from the Charlotte/Mecklenburg Quality of Life Explorer. This dataset represents the percentage of tree cover in each neighborhood. Like the transit coverage data, these values are linked to the housing data based on each property’s geolocation.
tree <- st_read("data/tree_canopy.geojson")%>%
dplyr::select(X2023, geometry)%>%
rename(canopy=X2023)
#spatial join the tree canopy data to the house data
house <- st_join(house, tree)
Approximate distance to the park data was collected from the Charlotte/Mecklenburg Quality of Life Explorer. This dataset measures the percentage of housing units within 0.5 miles of an outdoor public recreation area at the neighborhood level. Similar to other spatial datasets, the park accessibility data was assigned to individual housing records based on their geographic locations.
recreation<- st_read("data/recreation.geojson")%>%
dplyr::select(X2023, geometry)%>%
rename(recreation=X2023)
#spatial join the recreation data to the house data
house <- st_join(house, recreation)%>%
filter(!is.na(recreation))
The potential drawback of these approaches is that the data are aggregated at large geographic units rather than at the individual household level. Households within the same census block group or neighborhood may differ in their access to transit and other services. Additionally, households within the same neighborhood may vary significantly in income levels. Despite these limitations, the aggregated data represent the best available source for use in the model-building process.
The summary statistics and correlation matrix are calculated to understand the relationship between the variables. The summary statistics show the mean and standard deviation (SD) of the variables. The correlation matrix shows the correlation between the variables. The correlation matrix is used to understand the relationship between the variables and to identify the multicollinearity between the variables. Lastly, various scatter plots are created to visualize the relationship between the key predictors and the dependent variable to ensure the linearity relationship between the predictors and the dependent variable.
For the summary statistics, I calculated the mean and standard deviation of the variables. The summary statistics are calculated to understand the distribution of the variables. The summary statistics are used to understand the distribution of the variables and to identify the outliers in the data.
dependent_var <- "logprice"
predictors <- c("bedrooms", "fullbaths", "halfbaths", "heatedarea", "storyheigh", "numfirepla", "bldggrade", "units", "age", "income", "crime", "transit", "canopy", "recreation")
summary_stats <- house %>%
st_drop_geometry() %>%
dplyr::select(all_of(c(dependent_var, predictors))) %>%
summarise_all(list(Mean = mean, SD = sd), na.rm = TRUE) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
separate(Variable, into = c("Variable", "Stat"), sep = "_") %>%
pivot_wider(names_from = Stat, values_from = Value)
summary_stats <- summary_stats %>%
mutate(Variable = case_when(
Variable == "logprice" ~ "Log transformed House Value",
Variable == "bedrooms" ~ "Number of bedrooms",
Variable == "fullbaths" ~ "Number of full baths",
Variable == "halfbaths" ~ "Number of half baths",
Variable == "heatedarea" ~ "House Square footage",
Variable == "storyheigh" ~ "Number of stories",
Variable == "numfirepla" ~ "Number of fireplaces",
Variable == "bldggrade" ~ "Building Grade",
Variable == "units" ~ "Number of units",
Variable == "age" ~ "Age of the house",
Variable == "income" ~ "Median household income",
Variable == "crime" ~ "Number of Crime in Census Block group",
Variable == "transit" ~ "% Transit Coverage",
Variable == "canopy" ~ "% Tree Canopy Coverage",
Variable == "recreation" ~ "% Recreation Coverage",
TRUE ~ Variable
))
summary_stats <- summary_stats %>%
mutate(
Mean = round(Mean, 2),
SD = round(SD, 2)
)
summary_stats <- summary_stats %>%
arrange(Variable == "Log transformed House Value")
predictor_rows <- which(summary_stats$Variable != "Log transformed House Value")
dependent_rows <- which(summary_stats$Variable == "Log transformed House Value")
# Determine the start and end rows for each group
start_pred <- min(predictor_rows)
end_pred <- max(predictor_rows)
start_dep <- min(dependent_rows)
end_dep <- max(dependent_rows)
# Create the table using kable and add extra formatting
kable(summary_stats, caption = "Summary Statistics",
align = c("l", "l", "l"), booktabs = TRUE, escape = FALSE ) %>%
add_header_above(c(" " = 1, "Statistics" = 2)) %>%
kable_styling(full_width = FALSE) %>%
group_rows("Predictors", start_pred, end_pred) %>%
group_rows("Dependent Variable", start_dep, end_dep)%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)
Variable | Mean | SD |
---|---|---|
Predictors | ||
Number of bedrooms | 3.52 | 0.91 |
Number of full baths | 2.29 | 0.83 |
Number of half baths | 0.60 | 0.53 |
House Square footage | 2366.73 | 1057.49 |
Number of stories | 1.65 | 0.48 |
Number of fireplaces | 0.79 | 0.47 |
Building Grade | 3.43 | 0.78 |
Number of units | 0.98 | 0.19 |
Age of the house | 28.26 | 24.04 |
Median household income | 111240.12 | 50146.79 |
Number of Crime in Census Block group | 156.21 | 144.64 |
% Transit Coverage | 52.13 | 39.83 |
% Tree Canopy Coverage | 50.31 | 14.57 |
% Recreation Coverage | 55.12 | 33.92 |
Dependent Variable | ||
Log transformed House Value | 12.88 | 0.54 |
The correlation matrix is computed to assess the relationships among the predictors. A high correlation between predictors indicates multicollinearity, which violates one of the key assumptions of the linear regression model.
predictor_vars <- house[, c("bedrooms", "fullbaths", "halfbaths", "heatedarea", "storyheigh", "numfirepla", "bldggrade", "units", "age", "ac_dummy", "heat_dummy", "income", "crime", "transit", "canopy", "recreation")]%>%st_drop_geometry()
cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#a4133c", "white", "#a4133c"))+
labs(title = "Correlation Matrix for all Predictor Variables") +
theme(plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
axis.title = element_text(size = 8))
Since there is a high correlation between “heated area” and “full baths,” I removed “full baths” from the model. For the same reason, I also removed “number of bedrooms” and “Storyheight”.
After removing, the final correlation matrix is following:
predictor_vars <- house[, c( "halfbaths", "heatedarea", "numfirepla", "bldggrade", "units", "age", "ac_dummy", "heat_dummy", "income", "crime", "transit", "canopy", "recreation")]%>%st_drop_geometry()
cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#a4133c", "white", "#a4133c"))+
labs(title = "Correlation Matrix for all Predictor Variables") +
theme(plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
axis.title = element_text(size = 8))
Check the VIF
To ensure the multicollinearity between the variables, I checked the Variance Inflation Factor (VIF) of the variables. The VIF values are calculated to further check the multicollinearity between the variables. The VIF values are used to identify the multicollinearity between the variables. The VIF values above 5 indicate high multicollinearity between the variables. The VIF values are calculated to ensure the multicollinearity between the variables. The VIF values for all the variables in the model are below 3, which indicates that there is no multicollinearity between the variables.
model<- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age + ac_dummy + heat_dummy + income + crime + transit + canopy + recreation, data = house)
# Compute VIF values
vif_values <- vif(model)
# Convert the VIF values to a data frame
vif_df <- data.frame(Variable = names(vif_values),
VIF = as.vector(vif_values))
# Display the VIF table using knitr::kable
knitr::kable(vif_df, caption = "Variance Inflation Factors for the Model")
Variable | VIF |
---|---|
halfbaths | 1.199884 |
heatedarea | 2.448252 |
numfirepla | 1.249302 |
bldggrade | 2.023239 |
units | 1.063245 |
age | 1.727478 |
ac_dummy | 1.176752 |
heat_dummy | 1.037838 |
income | 1.784013 |
crime | 1.604203 |
transit | 1.942375 |
canopy | 1.382799 |
recreation | 1.364279 |
Since two spatial data, number of crimes and transit coverage, are aggregated at census block group and neighborhood level and then assigned to individual household based on their geographic location, the house within the same neighborhood have been assigned same number of crime and transit coverage location. Based on the scatterplot, housing prices within the same neighborhood exhibit significant variation, although they share the same transit coverage and crime rates. In this way, there are only weak linear relationship between those two predictors and housing prices.
The other two variable, age of house and heated area, are collected at individual household level. The scatterplot shows that the relationship betwee housing price and age are pretty weak, especially in the lower-left portion, indicating that newer houses show substantial variation in their prices. In addition, there are also many housing at the same age shows variety of different prices. There is pretty clear positive linear relationship between the heated area (square footage of the house) and the housing prices. As the square footage of the house increases, the housing prices also tend to increase.
longer<-house %>%
pivot_longer(cols = c(
"heatedarea",
"age",
"crime",
"transit"),
names_to = "Variable",
values_to = "Value")%>%
st_drop_geometry()
ggplot(longer,aes(x = Value, y = price)) +
geom_point(color = "black", size= 0.4) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
facet_wrap(~ Variable, scales = "free", labeller = as_labeller(c("heatedarea" = "Heated Area",
"age" = "Age of the House",
"crime" = "Number of Crime",
"transit" = "Transit Coverage"
))) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8)) +
labs(title = "Scatter Plots of Dependent Variable vs. Predictors",
x = "Value",
y = "House price")
The housing pattern of Charlotte, NC is visualized using the housing price data. I aggregatd the housing price data by mean housing price at the census block group level
The map illustrates distinct spatial disparities in housing prices across different census block groups in Charlotte, NC. Higher housing prices are concentrated in the central and southern parts of the city, as well as in northern suburban neighborhoods. Conversely, lower housing prices are more prevalent in the northern and western areas of the central city. The map also reveals spatial autocorrelation patterns, where higher housing price and lower housing prices are clustered together.
map<-st_join(house, income)%>%
group_by(GEOID)%>%
summarise(price=mean(price))%>%
st_drop_geometry()
map<- left_join(income, map, by="GEOID")%>%
filter(!is.na(price))
ggplot(map) +
geom_sf(aes(fill= q5(price)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(map, 'price')), 0),
name = 'Housing price') +
labs(title = "Average House Price by Census Block Group",
fill = "Price") +
theme(
legend.text = element_text(size = 8),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 10, face = "italic"),
plot.title = element_text(size = 20, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
Average household income follows a similar spatial pattern to housing prices, with higher income levels concentrated in the central and southern parts of the city, while lower income levels are more prevalent in the northern and western areas of the central city.
Tree canopy coverage is higher in suburban neighborhoods compared to the central city, as suburban areas typically have more green spaces and room for trees.
Recreation facilities coverage is greater in central city areas, where public spaces and recreational facilities, such as parks, are more common. In contrast, suburban areas tend to have fewer public spaces and recreational facilities.
Transit coverage follows a similar pattern to recreational facilities, with central city areas having higher transit availability than suburban areas.
Q1<-ggplot(map) +
geom_sf(aes(fill= q5(income)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(map, 'income')), 0),
name = 'Income') +
labs(title = "Average Household Income by Census Block Group",
fill = "Income") +
theme(
legend.text = element_text(size = 4),
legend.title = element_text(size = 5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 5, face = "italic"),
plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
Q2<-ggplot(tree) +
geom_sf(aes(fill= q5(canopy)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(tree, 'canopy')), 0),
name = '% of Tree Coverage') +
labs(title = "Tree Canopy by Neighborhood",
fill = "% of Tree Coverage") +
theme(
legend.text = element_text(size =4),
legend.title = element_text(size = 5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 5, face = "italic"),
plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
recreation<-recreation%>% filter(!is.na(recreation))
Q3<-ggplot(recreation) +
geom_sf(aes(fill= q5(recreation)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(tree, 'canopy')), 0),
name = '% of Recreation Coverage') +
labs(title = "Recreation facilities coverage by Neighborhood",
fill = "% of Recreational facilities Coverage") +
theme(
legend.text = element_text(size = 4),
legend.title = element_text(size = 5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 5, face = "italic"),
plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
transit<-transit%>% filter(!is.na(transit))
Q4<-ggplot(transit) +
geom_sf(aes(fill= q5(transit)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(transit, 'transit')), 0),
name = '% of Transit Coverage') +
labs(title = "Public Transit coverage by Neighborhood",
fill = "% of Public Transit Coverage") +
theme(
legend.text = element_text(size = 4),
legend.title = element_text(size = 5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 5, face = "italic"),
plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
# Arrange the plots in a 2x2 grid
grid.arrange(Q1, Q2, Q3, Q4, ncol = 2)
In this section, I outline the model building and evaluation process. The modeling process involves iteratively refining an Ordinary Least Squares (OLS) regression model to predict housing prices in Charlotte, NC by assessing variable significance and diagnostic measures. Finally, I evaluate the model fit and predictive reliability using key metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared.
In this model building process, I split the training and testing datasets based on 80% training and 20% testing to ensure the model have a better predictability.
I plug in different type of variables to the model to see which variables are significant, which variables are not, and the overfit of the model including the R-squared, MAE, and RMSE.
The initial model includes the key predictors such as the number of half baths, heated area, number of fireplaces, building grade, number of units, age of the house, median household income, number of crimes, transit coverage, tree canopy coverage, and recreation facilities coverage. The initial model is used to establish a baseline for comparison with subsequent models, and do not include any dummy variables or categorical variables.
model1 <- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age+ income + crime + transit + canopy + recreation, data = train)
summary_reg1 <- summary(model1)
coefficients_table <- as.data.frame(summary_reg1$coefficients)
coefficients_table$significance <- ifelse(coefficients_table$`Pr(>|t|)` < 0.001, '***',
ifelse(coefficients_table$`Pr(>|t|)` < 0.01, '**',
ifelse(coefficients_table$`Pr(>|t|)` < 0.05, '*',
ifelse(coefficients_table$`Pr(>|t|)` < 0.1, '.', ''))))
coefficients_table$p_value <- paste0(round(coefficients_table$`Pr(>|t|)`, digits = 3), coefficients_table$significance)
coefficients_table %>%
dplyr::select(Estimate,`Std. Error`,`t value`, p_value) %>%
kable(align = "r") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Estimate | Std. Error | t value | p_value | |
---|---|---|---|---|
(Intercept) | 11.3910827 | 0.0163744 | 695.666042 | 0*** |
halfbaths | -0.0111008 | 0.0039054 | -2.842400 | 0.004** |
heatedarea | 0.0002129 | 0.0000028 | 76.676260 | 0*** |
numfirepla | 0.0660415 | 0.0045020 | 14.669293 | 0*** |
bldggrade | 0.2364367 | 0.0034436 | 68.660365 | 0*** |
units | -0.0875891 | 0.0099904 | -8.767333 | 0*** |
age | 0.0011608 | 0.0000996 | 11.654407 | 0*** |
income | 0.0000024 | 0.0000001 | 47.187691 | 0*** |
crime | -0.0001382 | 0.0000164 | -8.413689 | 0*** |
transit | 0.0008452 | 0.0000659 | 12.833195 | 0*** |
canopy | -0.0024029 | 0.0001519 | -15.818608 | 0*** |
recreation | 0.0002657 | 0.0000648 | 4.102723 | 0*** |
test<-test%>%
mutate(predictions = predict(model1, newdata = test),
actual= logprice,
error= actual - predictions,
abserror= abs(error),
ape=(abs(actual-predictions)) / actual)
# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)
# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))
# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst
# Print the results
results <- data.frame(
Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)
# Print results using kable
kable(results, caption = "Model Performance Metrics")
Metric | Value |
---|---|
Mean Absolute Error (MAE) | 0.21552 |
Root Mean Squared Error (RMSE) | 0.30249 |
Test R-squared | 0.68026 |
Then, I added the two dummy variables, air conditioning and heating system, to the model to see if the dummy variables are significant to the model. The dummy variables are created to check if the house has air conditioning or heating system, as housing with air conditioning and heating system tends to have higher price.
As the results of the model of mean absolute error (MAE), root mean
squared error (RMSE), and R-squared are improved. However, the
heat_dummy
variable is not statistics significant,
according to the higher p-value.
model2 <- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age + ac_dummy + heat_dummy + income + crime + transit + canopy + recreation, data = train)
summary_reg2 <- summary(model2)
coefficients_table <- as.data.frame(summary_reg2$coefficients)
coefficients_table$significance <- ifelse(coefficients_table$`Pr(>|t|)` < 0.001, '***',
ifelse(coefficients_table$`Pr(>|t|)` < 0.01, '**',
ifelse(coefficients_table$`Pr(>|t|)` < 0.05, '*',
ifelse(coefficients_table$`Pr(>|t|)` < 0.1, '.', ''))))
coefficients_table$p_value <- paste0(round(coefficients_table$`Pr(>|t|)`, digits = 3), coefficients_table$significance)
coefficients_table %>%
dplyr::select(Estimate,`Std. Error`,`t value`, p_value) %>%
kable(align = "r") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Estimate | Std. Error | t value | p_value | |
---|---|---|---|---|
(Intercept) | 11.2400971 | 0.0525498 | 213.8943796 | 0*** |
halfbaths | -0.0122958 | 0.0038851 | -3.1648794 | 0.002** |
heatedarea | 0.0002132 | 0.0000028 | 77.1854338 | 0*** |
numfirepla | 0.0571747 | 0.0045076 | 12.6840512 | 0*** |
bldggrade | 0.2356714 | 0.0034258 | 68.7926373 | 0*** |
units | -0.0826566 | 0.0099406 | -8.3150422 | 0*** |
age | 0.0015906 | 0.0001021 | 15.5714104 | 0*** |
ac_dummy | 0.1971968 | 0.0113131 | 17.4308814 | 0*** |
heat_dummy | -0.0386824 | 0.0507381 | -0.7623936 | 0.446 |
income | 0.0000023 | 0.0000000 | 46.5651561 | 0*** |
crime | -0.0001220 | 0.0000164 | -7.4529941 | 0*** |
transit | 0.0008427 | 0.0000655 | 12.8640985 | 0*** |
canopy | -0.0025510 | 0.0001513 | -16.8572208 | 0*** |
recreation | 0.0002895 | 0.0000644 | 4.4941101 | 0*** |
test<-test%>%
mutate(predictions = predict(model2, newdata = test),
actual= logprice,
error= actual - predictions,
abserror= abs(error),
ape=(abs(actual-predictions)) / actual)
# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)
# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))
# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst
# Print the results
results <- data.frame(
Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)
# Print results using kable
kable(results, caption = "Model Performance Metrics")
Metric | Value |
---|---|
Mean Absolute Error (MAE) | 0.21518 |
Root Mean Squared Error (RMSE) | 0.30156 |
Test R-squared | 0.68221 |
After that, I tested a new model only include the key predictors and
a categorical variable type
. The type variable categorize
the house into mansion and castle and normal home.
The new model’s mean absolute error (MAE), root mean squared error (RMSE), and R-squared are decreased from the initial model, but are higher than the dummy variable model. The new categorical variable is also statistics significant as indicated by lower p-value.
model3 <- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age+ income + crime + transit + canopy + recreation + type, data = train)
summary_reg3 <- summary(model3)
coefficients_table <- as.data.frame(summary_reg3$coefficients)
coefficients_table$significance <- ifelse(coefficients_table$`Pr(>|t|)` < 0.001, '***',
ifelse(coefficients_table$`Pr(>|t|)` < 0.01, '**',
ifelse(coefficients_table$`Pr(>|t|)` < 0.05, '*',
ifelse(coefficients_table$`Pr(>|t|)` < 0.1, '.', ''))))
coefficients_table$p_value <- paste0(round(coefficients_table$`Pr(>|t|)`, digits = 3), coefficients_table$significance)
coefficients_table %>%
dplyr::select(Estimate,`Std. Error`,`t value`, p_value) %>%
kable(align = "r") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Estimate | Std. Error | t value | p_value | |
---|---|---|---|---|
(Intercept) | 10.4827183 | 0.1604957 | 65.314631 | 0*** |
halfbaths | -0.0109420 | 0.0039005 | -2.805245 | 0.005** |
heatedarea | 0.0002217 | 0.0000030 | 75.129583 | 0*** |
numfirepla | 0.0646007 | 0.0044984 | 14.360823 | 0*** |
bldggrade | 0.2380297 | 0.0034497 | 68.999869 | 0*** |
units | -0.0876315 | 0.0099761 | -8.784129 | 0*** |
age | 0.0012400 | 0.0000999 | 12.414693 | 0*** |
income | 0.0000023 | 0.0000001 | 46.353011 | 0*** |
crime | -0.0001402 | 0.0000164 | -8.545983 | 0*** |
transit | 0.0008590 | 0.0000658 | 13.055373 | 0*** |
canopy | -0.0024351 | 0.0001517 | -16.049080 | 0*** |
recreation | 0.0002581 | 0.0000647 | 3.990978 | 0*** |
typeHouse | 0.8888463 | 0.1593939 | 5.576415 | 0*** |
typeMansion | 0.7627356 | 0.1592928 | 4.788262 | 0*** |
test<-test%>%
mutate(predictions = predict(model3, newdata = test),
actual= logprice,
error= actual - predictions,
abserror= abs(error),
ape=(abs(actual-predictions)) / actual)
# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)
# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))
# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst
# Print the results
results <- data.frame(
Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)
# Print results using kable
kable(results, caption = "Model Performance Metrics")
Metric | Value |
---|---|
Mean Absolute Error (MAE) | 0.21529 |
Root Mean Squared Error (RMSE) | 0.30205 |
Test R-squared | 0.68118 |
After twisting the variables, I built the final model showing as follow. The final model include some of the essential key predictors, dummy vairbles, and the categorical variables I tested earlier.
In this case, the coefficient indicate 1 unit increase of one
specific predictors, on average the house price change by \((e^{\beta} - 1) \times 100\%\), \({\beta}\) is the coefficient. The p-value
for all the predictors except heat_dummy
is less than
0.0001, which indicate all other variable are statistics significant in
explaining the variance of the housing price.
The adjusted R-squared of the model is approximately 0.6644, indicating that the model explains over 66.44% of the variance in housing prices. This suggests a strong fit for the training dataset.
model4 <- lm(logprice ~ heatedarea + numfirepla+ bldggrade + units + age+ income + crime + transit + canopy + type+ ac_dummy + heat_dummy, data = train)
summary_reg4 <- summary(model4)
coefficients_table <- as.data.frame(summary_reg4$coefficients)
coefficients_table$significance <- ifelse(coefficients_table$`Pr(>|t|)` < 0.001, '***',
ifelse(coefficients_table$`Pr(>|t|)` < 0.01, '**',
ifelse(coefficients_table$`Pr(>|t|)` < 0.05, '*',
ifelse(coefficients_table$`Pr(>|t|)` < 0.1, '.', ''))))
coefficients_table$p_value <- paste0(round(coefficients_table$`Pr(>|t|)`, digits = 3), coefficients_table$significance)
coefficients_table %>%
dplyr::select(Estimate,`Std. Error`,`t value`, p_value) %>%
kable(align = "r") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Estimate | Std. Error | t value | p_value | |
---|---|---|---|---|
(Intercept) | 10.3634828 | 0.1672683 | 61.9572404 | 0*** |
heatedarea | 0.0002195 | 0.0000029 | 76.1288020 | 0*** |
numfirepla | 0.0554717 | 0.0045051 | 12.3129810 | 0*** |
bldggrade | 0.2368023 | 0.0034308 | 69.0220888 | 0*** |
units | -0.0828568 | 0.0099315 | -8.3428651 | 0*** |
age | 0.0017495 | 0.0001003 | 17.4370498 | 0*** |
income | 0.0000023 | 0.0000000 | 46.1760057 | 0*** |
crime | -0.0001240 | 0.0000162 | -7.6649619 | 0*** |
transit | 0.0009508 | 0.0000618 | 15.3803487 | 0*** |
canopy | -0.0026327 | 0.0001507 | -17.4710248 | 0*** |
typeHouse | 0.8718119 | 0.1586232 | 5.4961179 | 0*** |
typeMansion | 0.7526765 | 0.1585185 | 4.7481924 | 0*** |
ac_dummy | 0.1930015 | 0.0113043 | 17.0732484 | 0*** |
heat_dummy | -0.0396974 | 0.0506951 | -0.7830625 | 0.434 |
The model also performs well on the testing dataset, as indicated by its error metrics. The mean absolute error (MAE) of 0.1745 indicates on average, the model’s predictions are off by about 0.17 units from the actual values. Similarly, the root mean squared error (RMSE) of 0.2285. The R-squared value of 0.683 indicates that the model explains 68.3% of the variation in the testing dataset. Overall, these results suggest that the model makes reasonably accurate predictions with relatively low errors and a strong fit to the data.
test<-test%>%
mutate(predictions = predict(model4, newdata = test),
actual= logprice,
error= actual - predictions,
abserror= abs(error),
ape=(abs(actual-predictions)) / actual)
# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)
# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))
# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst
# Print the results
results <- data.frame(
Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)
# Print results using kable
kable(results, caption = "Model Performance Metrics")
Metric | Value |
---|---|
Mean Absolute Error (MAE) | 0.21505 |
Root Mean Squared Error (RMSE) | 0.30126 |
Test R-squared | 0.68285 |
Based on the four standard diagnostic plots, the residuals vs fitted values plot shows a random pattern around 0. Although the smoother is not perfectly flat, but it does not show a strng systematic curve. The Q-Q plot shows that the residuals are normally distributed, as the points are close to the straight line. There are some points deviate from the diagonal at both tails indicating several potential outliers in the model. The Scale-Location plot shows that the residuals are homoscedastic, as the residuals are evenly distributed. For the residuals vs leverage plot, the red line does not show a strong slope, and the vertical spread is not drastically increasing or decreasing, which met the OLS assumptions of homoscadasticity. Overall, the OLS regressions’ assumptions are met.
Compare with the model ehich the dependent variable is not log transformed
I also check the diagnostics chart for non log transformed model. From following charts, we see that the Residual vs Fitted values plot shows a curved trend in the non-log transformed model, indicating that there may be a non linearity relationship or heteroskedasticity. The Q-Q plot shows that the residuals are not normally distributed, as the points deviate from the straight line. The Scale-Location plot shows that the residuals are not homoscedastic, as the residuals are not evenly distributed. The assumptions of the OLS regressions assumptions tend to violate much more significantly in the non-log transformed model than the log transformed model.
model5 <- lm(price ~ heatedarea + numfirepla+ bldggrade + units + age+ income + crime + transit + canopy + type+ ac_dummy + heat_dummy, data = train)
par(mfrow = c(2, 2))
plot(model5)
In this section, I conduct spatial analysis and residual diagnostics to evaluate the spatial patterns and autocorrelation of the residuals in the final model. I use Moran’s I to assess the spatial autocorrelation of the residuals and visualize the spatial distribution of the residuals on a map.
The scatter plot of predicted values vs. observed values shows a strong linear relationship between the two, indicating that the model’s predictions are close to the actual values.
model_final<-lm(logprice ~ heatedarea + numfirepla+ bldggrade + units + age+ income + crime + transit + canopy + type+ ac_dummy + heat_dummy, data = house)
fitted_values <- fitted(model_final)
residuals <- residuals(model_final)
house<- house %>%
mutate(fitted = fitted_values,
residuals = residuals)
ggplot(house, aes(x = fitted, y = logprice)) +
geom_point(color = "black", size = 0.5) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Predicted Values vs. Observed Values",
x = "Observed Values",
y = "Predicted Values") +
theme_light() +
theme(plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
axis.title = element_text(size = 8))
The residuals exhibit a distinct spatial pattern. Higher residual values are predominantly concentrated in the central areas of the city. In contrast, the northern sections of the city tend to display comparatively lower residuals. This spatial distribution indicate that underlying factors that are not included in the model, also play roles in housing price.
residual_map<-st_join(income, house)%>%
group_by(GEOID)%>%
summarise(residual=mean(residuals))%>%
filter(!is.na(residual))
ggplot(residual_map) +
geom_sf(aes(fill= q5(residual)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(residual_map, 'residual')), 6),
name = 'Residuals') +
labs(title = "Residuals Spatial Distribution",
fill = "Residuals") +
theme(
legend.text = element_text(size =8),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 8, face = "italic"),
plot.title = element_text(size = 20, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
To further assess the autocorrelation of the residuals, I perform a Moran’s I test on the residuals of the final model. Moran’s I quantifies spatial autocorrelation, checking whether the residuals display clustering patterns. This test helps determine the presence of spatial autocorrelation in the residuals of the final model.
# Extract the coordinates from the spatial dataframe
coords <- st_coordinates(test)
# Create a neighbor list using k-nearest neighbors (KNN) with k=5
neighborList <- knn2nb(knearneigh(coords, 5))
# Convert the neighbor list to a spatial weights matrix
spatialWeights <- nb2listw(neighborList, style="W")
moran <- moran.mc(test$error, spatialWeights, nsim = 999)
ggplot(as.data.frame(moran$res[c(1:999)]), aes(moran$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moran$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
theme_minimal()
# Extract the coordinates from the spatial dataframe
coords <- st_coordinates(house)
# Create a neighbor list using k-nearest neighbors (KNN) with k=5
neighborList <- knn2nb(knearneigh(coords, 5))
# Convert the neighbor list to a spatial weights matrix
spatialWeights <- nb2listw(neighborList, style="W")
moran <- moran.mc(house$residuals, spatialWeights, nsim = 999)
ggplot(as.data.frame(moran$res[c(1:999)]), aes(moran$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moran$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
theme_minimal()
The Moran’s I test results for both the whole and testing datasets show that the observed Moran’s I value is significantly higher than the permuted values, which indicate the presence of spatial autocorrelation in the residuals. This suggests that the residuals are not randomly distributed across the city, but rather display clustering patterns. The presence of spatial autocorrelation in the residuals indicates that there are spatial factors influencing housing prices that are not captured by the model.
The final model perfrom relative well to predict the house price with lower values in the northern part of the cities and the household with extreme higher values categorized as castle or mansions.
The model doesn’t perform well with house have relative higher values, mainly concentrated in the southern part of cities. There are many other factors not included in the model that may influence the housing prices, such as the quality of the house, the neighborhood, and the local amenities. The model also does not account for the impact of external factors such as economic conditions, market trends, and policy changes. The model may also be affected by the spatial autocorrelation of the residuals, indicating that there are spatial factors influencing housing prices that are not captured by the model.
To further improve the model, first, it is important to have a more comprehensive datasets that contains more variables such as income, structure at the individual household level. Second, aSecond, it is important to address the spatial autocorrelation of the residuals by incorporating spatial factors that influence housing prices. Third, it is important to consider the impact of external factors such as market trends, new development projects, like TOD, and policy.
In conclusion, my model is effective in predicting property values in the city of Charlotte only to a certain extent. While I am able to fit a model that account for variations of sale price in terms of their internal characteristics, spatial locations, as well as nearby amenities, there are still many limitations in this prediction model. First, many spatial data is aggregated at the census block group level rather than at individual household. Second, the model does not account for the impact of external factors such as economic conditions, market trends, and policy changes. Third, the model may be affected by the spatial autocorrelation of the residuals, indicating that there are spatial factors influencing housing prices that are not captured by the model. Despite these limitations, the model provides a useful framework for understanding the factors that influence housing prices in Charlotte, NC, and can be used as a starting point for further research and analysis.