Exploratory Spatial Data Analysis

Introduction

Following Anselin (1998), exploratory spatial data analysis (ESDA) is a collection of techniques to describe and visualize spatial distributions; identify atypical locations or spatial outliers; discover patterns of spatial association, clusters or hot-spots; and suggest spatial regimes or other forms of spatial heterogeneity. Central to this conceptualization is the notion of spatial autocorrelation or spatial association, i.e., the phenomenon where locational similarity (observations in spatial proximity) is matched by value similarity (attribute correlation). True ESDA pays attention to both spatial and attribute association.

ESDA is a subset of exploratory data analysis (EDA) methods that focus on the distinguishing characteristics of geographical data and, specifically, on spatial autocorrelation and spatial heterogeneity. Exploratory Data Analysis graphical and visual methods are used to identify data properties for purposes of pattern detection in data, hypothesis formulation from the data, and aspects of models assessment (e.g., goodness-of-fit). With EDA the emphasis is on descriptive methods rather than formal hypothesis testing. Tukey (1977) is the classic citation on EDA. ESDA techniques can help detect spatial patterns in data, lead to the formulation of hypotheses based on the geography of the data, and in assessing spatial models. ESDA requires that numerical and graphical procedures be linked with a map. This linkage allows a researcher to answer questions such as “Where are those observations/outliers?”

Visual Inspection of Mapped Data

Before estimating models with data that has been aggregated to geographic levels (e.g., census tract, neighborhood, county, state), it is best to use ESDA. To do so, a researcher can begin by examining various forms of dynamically linked windows. GeoDa (Legacy GeoDa 2005), one of various software packages for conducting ESDA (others include STARS and ESTAT within the GeoVista Studio), has the capacity to generate box-plots, histographs, and scatterplots, in addition to the mapping capabilities. It maps the variables that will be used in the analysis and visually investigates the map for spatial patterns. It is most important to do this for the dependent variable. Here, we will present an example of ESDA using the county-level teenage birth rate.

Figure 1 shows the distribution of teenage birth rates across counties throughout the United States using the quantile classification. Counties with low birth rates are displayed in ivory and, as the birth rate increases, the shading becomes progressively darker. As shown in the map, the counties in the southern half of the United States tend to have higher teenage birth rates, with additional pockets of high teenage births in Kentucky and southern West Virginia.

TBR

Spatial Autocorrelation

Since we have identified an uneven spatial distribution of the teenage birth rate across counties (Figure 1), we anticipate that counties will not be independent of each other. Therefore, we need to test for spatial autocorrelation. Spatial autocorrelation is the identification of similarity in values in a similar location—counties near each other are more similar. For example, if either high or low teenage birth rates are found in a county and its adjacent counties,it would be an instance of positive spatial autocorrelation. Spatial clustering is said to occur with positive spatial autocorrelation (Anselin et al., 2000). In contrast, negative spatial autocorrelation would occur if a county with a high teenage birth rate is surrounded by counties with low teenage birth rates, or a county with a low teenage birth rate is surrounded by counties with high teenage birth rates. In that case, negative spatial autocorrelation is present in the form of spatial outliers (Anselin et al., 2000). For those counties where no correlation exists between teenage birth rates and their locations, the spatial pattern is considered to exhibit zero spatial autocorrelation (Holt, 2007). Spatial randomness, where any grouping of high or low values would be just as likely to occur as any other arrangement, is the point of reference in spatial autocorrelation (Anselin et al., 2000).

In order to define the neighbors of counties, a contiguity spatial weights matrix was created in GeoDa from the county boundary shapefile of the continental United States. For this example, a Queen’s 1 spatial weights matrices was created and used. A Queen’s 1 spatial weight includes all counties that have any point in common with a certain county, including the counties that immediately surround a county’s boundary as well as those with common corners (Anselin, 2005). This spatial weights matrix is considered a contiguity-based spatial weight, compared to other spatial weights that can be used that are based on distance thresholds. A spatial weights matrix is used to confirm location similarity (Anselin et al., 2000).

Moran’s I
The Moran’s I statistics was calculated for the dependent variable, teenage birth rate, as well as for all of the independent variables that were to be used in the analysis. Nine hundred ninety-nine random permutations were run for each Moran’s I statistic calculated. These random permutations are run to recalculate the statistic many times to generate a reference distribution and pseudo significance level (Anselin, 2005). A reference distribution is created that simulates spatial randomness by randomly rearranging the observed values over the available location and recalculating the statistic for each random arrangement (Anselin et al., 2000).

For this example, spatial autocorrelation was identified for each of the independent variables that were to be used in the analysis as well as for the dependent variablethe teenage birth rate. For the sake of brevity, we will only present the Moran’s I statistic using the Queen’s 1 spatial weight for the dependent variable. A Moran scatterplot is displayed in Figure 2. A Moran scatter plot is centered on the mean and has the variable of interest on the x-axis and the spatial lag variable on the y-axis (Anselin, 2005; Anselin et al., 2000). Each quadrant in the scatter plot corresponds to a different type of spatial autocorrelation (Anselin, 2005). High-high (upper right) and low-low (lower left) represent positive spatial autocorrelation (spatial clusters) and low-high and high-low are negative spatial autocorrelation (spatial outliers) (Anselin, 2005; Anselin et al., 2000). The slope of the linear regression line that runs through the scatter plot is the Moran’s I coefficient (Anselin et al., 2000).

moransI

From examining the Moran’s I scatterplot (Figure 2) and identifying a significant Moran’s I statistic (0.370; p<0.001), we find clear evidence of spatial autocorrelation in the teenage birth rate. However, the question remains, “What causes spatial autocorrelation in teenage birth rates?” In the Voss et al. (2006) study that examines spatial autocorrelation in county-level child poverty rates, four mechanisms that can cause spatial autocorrelation are thoroughly discussed. Briefly, these mechanisms include: (1) feedback, which is when individuals and households with residential proximity frequently interact and influence each other; (2) grouping forces, where individuals and households with common characteristics are clustered together either by choice or because they are constrained to live together because of social, economic, or political forces; (3) grouping responses, which is when individuals or households that share common characteristics respond similarly to external forces; (4) nuisance autocorrelation, which occurs when the underlying spatial process creates regions of attribute value clusters that are larger than the unit of analysis used (375–376). Any of these mechanisms could be the reason why spatial autocorrelation exists in teenage birth rate; it could even be a combination of these mechanisms.

Local Indicator of Spatial Association (LISA)
We have identified that spatial autocorrelation in the teenage birth rate exists; the Moran’s I statistic only indicates the presence of spatial autocorrelation globally, it does not provide information on the specific locations of spatial patterns (Holt 2007). In order to determine the location and magnitude of spatial autocorrelation, Anselin’s local indicator of spatial association (LISA) is necessary. Anselin (1994) defines LISA as any statistic that satisfies the following two requirements:

  1. “the LISA for each observation gives an indication of the extent of significant spatial clustering of similar values around that observation;
  2. the sum of LISAs for all observations is proportional to a global indicator of spatial association” (2).

For this example, a LISA cluster map was created for the teenage birth rate using the random permutation function for 999 random permutations. Also, the significance level was changed from 0.05 to 0.01, because Anselin (1995) suggests that the 0.05 significance level may not be the appropriate significance cut-off value for LISA cluster maps. Figure 3 displays the LISA cluster map for the teenage birth rate. The high-high and low-low locations are referred to as spatial clusters and the high-low and low-high locations are spatial outliers (Anselin 2005). Spatial clustering and spatial outliers of teenage birth rates are visually apparent in Figure 3. In general, high teenage birth rate clusters occur in the south, while low birth rate teenage clusters are apparent in the northeast and eastern part of the Midwest. Spatial outliers are located in the west.

LISA

Why use ESDA?

When estimating models with data that has been aggregated to geographic levels, in this case counties, it is common to find spatially autocorrelated residuals (Voss et al., 2006). It is important to test for spatial autocorrelation when using data at the county level because when spatial autocorrelation exists, regression analysis of spatially distributed variables can lead to incorrect statistical inference when proper corrections for spatial effects are not incorporated in the model specifications (Voss et al., 2006).

When spatial autocorrelation exists in regression models, the independence assumption for errors is violated, and statistical inference is unreliable (Voss et al., 2006). According to Voss et al. (2006), statistical inference is unreliable because: (1) “the estimated regression parameters are biased and inconsistent, or (2) standard errors of the parameter estimates are biased (377).” If spatial autocorrelation exists in an ordinary least squares regression model, the estimated standard errors results are smaller than the actual standard errors on the estimated coefficients (Voss et al., 2006). This can result in claiming that some coefficients are statistically significant when they are not (Voss et al., 2006). These effects can be corrected by using more properly specified models.

Because of the presence of spatially autocorrelated teenage birth rate data, caution must be used in using analytic techniques, such as ordinary least squares least squares regression, that rely upon assumptions of the independence of observations (Holt, 2007). The teenage birth rate data violates this assumption because it is spatially autocorrelated; therefore, further analyses will need to be conducted to determine if the spatial autocorrelation is affecting the results of the traditional ordinary least squares regression models. Spatial regression modeling techniques are described elsewhere on this website.

Additional Resources

Andrienko, N. and G. Andrienko. 2005. Exploratory analysis of spatial and temporal data: A systematic approach. New York: Springer.

Exploratory Spatio-Temporal Analysis Toolkit (ESTAT)

GeoDa Center for Geospatial Analysis and Computation

Space-Time Analysis of Regional Systems (STARS)

References

Anselin, L. 1994. Local indicators of spatial associationLISA. Research Paper 9331: 125.
—. 1995. Local indicators of spatial associationLISA. Geographical Analysis 27: 93115.
—. 1998. Exploratory spatial data analysis in a geocomputational environment. Pp. 7794 in Geocomputation: A Primer, edited by P.A. Longley, S.M. Brooks, R. McDonnell, and W. Macmillian. New York: Wiley and Sons.
—. 2005. Exploring spatial data with GeoDa: A workbook. Urbana-Champaign: Center for Spatially Integrated Social Science.
Anselin, L., J. Cohen, D. Cook, W. Gorr, and G. Tita. 2000. Spatial analysis of crime. Criminal Justice 4: 213262.
Holt, J.B. 2007. The topography of poverty in the United States: A spatial analysis using county-level data from the community health status indicators project. Preventing Chronic Disease 4(4): 19.

Legacy GeoDa. 2005. Open GeoDa. Edited by L. Anselin, I. Syabri, and Y. Kho. Tempe, AZ: GeoDa Center for Geospatial Analysis and Computation.

Tukey, J.W. 1977. Exploratory data analysis. Reading: Addison-Wesley.

Voss, P.R., D.D. Long, and R.B. Hammer. 2006. County child poverty rates in the U.S.: A spatial regression approach. Population Research Policy Review 25: 369391.

by Carla Shoff, Ph.D.
Research Associate
Population Research Institute
The Pennsylvania State University