Utilizing Spatial Statistics Algorithms for Spatial Predictions
In our world today, we have access to enormous amounts of geo-tagged data. Instead of letting it sit in a database or text file, we have the ability to utilize that information in various ways enabling us to create new information for better decision making.
One of the benefits to spatial statistics models is being able to analyze any point is space by calculating the correlation between nearby points and the point of interest. There are a number of spatial models that can be used, but my favorite is Kriging. While other methods such as Geographically Weighted Regression (GWR) or Inverse Distance Weighting (IDW) can be useful exploratory tools, in my experience, Kriging tends to scale better as the data increases and, in many cases, makes more accurate predictions.
In this article, I show how Kriging in R can be utilized to make predictions and provide Tip & Tricks using the police station data from the City of Chicago (data found here). The goal is to make interpolations of the frequency of crime in the city of Chicago at any given point. This spatial statistics technique allows us to harness the data we have, build a model to explain the spatial structure of the area, and make predictions at any other latitudinal/longitudinal point in space based on that spatial structure.
Before we begin, I’d like to discuss the data set so that we understand what is being predicted. The data can be found here and I downloaded the csv onto my local drive. The file has 22 variables and over 7 million rows, but for this tutorial we will only be utilizing the coordinate points (Latitude and Longitude), the Primary Type of crime, and the Description of the crime.
[Tips & Tricks] It’s important to Krige only the value of a specific category type. This means that we can interpolate points for only one type of crime at a time, otherwise the interpolated points won’t make any sense. Thus, the specific crime that will be focused on is Criminal Damage To Vehicle and we will use Kriging to interpolate the frequency of this type of crime in the city of Chicago.
R Code for Data Prep:
library(data.table)# fast read/write function
df_1 <- reported_crimes %>% filter(PrimaryType=='CRIMINAL DAMAGE') %>% filter(Description=='TO VEHICLE')
df_2 <- df_1 %>%
group_by(Latitude, Longitude) %>%
inner_join(df_1, by=c('Latitude', 'Longitude')) %>%
distinct(Latitude, Longitude, .keep_all = TRUE) %>%
select(Latitude, Longitude, n) %>%
mutate(location_id = group_indices(., Latitude, Longitude))
Once we filter the data, count the frequency of that crime (with variable name n) at every unique location (Latitude and Longitude), and add an index for each unique location (location_id), the data will look similar to Figure 1.
The idea behind kriging is to use a limited set of data points to predict other nearby points in a given area. This method allows scientists in the field to only sample of small number of data points and interpolate the rest which saves time and money. Therefore, to simulate this approach, we can randomly take 10,000 points as the training set to build the model (shown in Figure 2(a)) .
R Code for Creating Training Set:
random_generator <- sample(1:nrow(df_2), round(nrow(df_2)*.3), replace=F)
train_df <- df_2 %>%
filter(!location_id %in% random_generator) %>%
We then can use the model on a grid of points around the area of interest as shown in Figure 2(b). The grid of points were autogenerated by building a polygon and creating over 76,000 latitudinal/longitudinal points within the polygon (the code has been excluded from this article for brevity).
Building the Variogram
Once the data is prepped, the first step is to build a variogram and fit a curve function to it which can then be used to interpolate values for the grid of points. Thankfully, with the gstat package in R, this can be done easily using the variogram function.
R Code for Building Variogram Model:
coordinates(train_df) <- c("Longitude", "Latitude")
proj4string(train_df) <- CRS("+proj=longlat +datum=WGS84")
lzn.vgm <- variogram(log(n) ~ Latitude + Longitude, train_df, width=0.1)
lzn.fit = fit.variogram(lzn.vgm, vgm(c("Gau", "Sph", "Mat", "Exp")), fit.kappa = TRUE)
[Tips & Tricks] The log function was used to ensure that all interpolated values remain positive. The exp function will be used to normalize the results in the ‘Let’s Krige’ section of this article.
The variogram and the curve function can be plotted to understand the spatial structure of our area and can be shown in Figure 3. Visually speaking, we can see that the curve function (Matern) can describe the autocorrelation between points relatively well. We can see that around 1.1 km, the point pairs are no longer spatially correlated.
We can now use this information to krige the grid of points using the krige function from the gstat package. As mentioned previously, the grid of points have already been generated and have been omitted from this article for brevity. However, any of the other points in the data that have not been included in the training set can be used to interpolate points and test the accuracy.
R Code for Kriging:
coordinates(grid) <- c("Longitude", "Latitude")
proj4string(grid) <- CRS("+proj=longlat +datum=WGS84")
lzn.kriged <-krige(log(n) ~ Latitude + Longitude, train_df, grid, model = lzn.fit, maxdist=10, nmax=50)
# Retrieve interpolated values
predicted <- lzn.kriged@data$var1.pred %>%
rename(krige_pred = 1) %>%
variance <- lzn.kriged@data$var1.var %>%
mutate(variance = exp(variance))
Figure 4(a) shows the points that were used to build the model and is zoomed into uptown Chicago to see the data points better. Once the code has been applied and the grid points have been kriged, we can see the predictions shown in Figure 4(b). Displayed are the hot spots of where the model calculates the highest frequencies for Criminal Damage To Vehicle as well as the areas that may have lower frequencies of that crime. Kriging fills in the blanks of the data we don’t have. While we may have many data points to choose from in Figure 4(a), not every point in this area is covered. For all those points that we don’t have, Kriging comes into play and fills in the holes adding coverage to the area as shown in Figure 4(b).
Eloquently stated by Esri,
Spatial analysis allows you to solve complex location-oriented problems and better understand where and what is occurring in your world. It goes beyond mere mapping to let you study the characteristics of places and the relationships between them. Spatial analysis lends new perspectives to your decision-making.
Understanding our data spatially is an important step in making predictions. This is one of the reasons I like kriging. When we build our variogram (shown in Figure 3), it is clear when the autocorrelation between points stops which tells us at what distance, the data is no longer relevant. The variogram is then utilized to make interpolations on new data.
While the visuals are always exciting, testing the accuracy of the model is vital before implementing it in the real world. Although it was not discussed in this article, tests were conducted to measure the accuracy of the model using the train/test split method on the data (14% training and 86% testing) with an RMSE of 1.81 and MAE of 0.80.
With spatial statistics, we can turn our intuition of how spatially correlated the data might be and mathematically calculate exactly at what point that correlation ends. I hope this article helps you to brainstorm what other data sets might be spatially correlated and how Kriging can help you understand it better!