Working with GIS data manually

Let's take a brief look at the process of working with GIS data manually. Before we can begin, there are two things you need to do:

  • Obtain some GIS data
  • Install the GDAL library and its Python bindings so that you can read the necessary data files

Obtaining the data

Let's use the US Census Bureau's web site to download a set of vector maps for the various US states. The main site for obtaining GIS data from the US Census Bureau can be found at http://www.census.gov/geo/www/tiger.

To make things simpler, though, let's bypass the web site and directly download the file we need from http://www2.census.gov/geo/tiger/TIGER2014/STATE/tl_2014_us_state.zip.

The file, tl_2014_us_state.zip, should be a ZIP-format archive. After uncompressing the archive, you should have a directory containing the following files:

  • tl_2014_us_state.dbf
  • tl_2014_us_state.prj
  • tl_2014_us_state.shp
  • tl_2014_us_state.shx

These files make up a shapefile containing the outlines of all the US states. Place these files together in convenient directory.

Note

There will be a few extra files with a suffix of .xml. These contain additional information about the downloaded data but are not part of the shapefile itself.

Installing GDAL

We next want to download and install the GDAL Python library. The main web site for GDAL can be found at http://gdal.org.

How you install GDAL depends on which operating system you are running.

Installing GDAL on Linux

To install GDAL on a Linux machine, you would typically use your computer's package manager to install the GDAL package itself. For example, on Fedora Linux, you would use the following command:

yum install gdal gdal-devel

Note that you need to install the gdal-devel package as well as gdal itself. The gdal-devel package includes the files needed to compile the Python bindings for GDAL.

Once you have installed GDAL itself, you'll need to install and build the Python bindings. These Python bindings can be found at https://pypi.python.org/pypi/GDAL. Typically, you would use a command such as the following:

sudo easy_install GDAL

Obviously, you will need to have Python installed before you can install the Python bindings for GDAL.

Installing GDAL on Mac OS X

To install GDAL on a Mac, you need to install the GDAL library itself, and then compile the Python bindings to use the version of GDAL you installed. Let's work through the various steps one at a time:

  1. To download GDAL, go to http://www.kyngchaos.com/software/frameworks. You will need to install the GDAL Complete package. This package includes a precompiled version of GDAL, which you should install onto your computer.
  2. Once you have installed GDAL, you next need to compile the Python bindings. To do this, you first need to install Xcode, which is Apple's development environment. Xcode can be downloaded for free from the Mac App store.

    Note

    Note that if your computer runs a version of Mac OS X lower than 10.9 (Yosemite), you will need to separately install the command-line tools for Xcode. To do this, start up Xcode and choose the Preferences… command from the Xcode menu. In the Downloads tab will be an option to install the command-line tools; enable this option and wait for the required tools to be installed.

  3. Once you have Xcode installed, you can download and compile the Python bindings for GDAL. The easiest way to do this is to use pip, the Python package manager. If you don't already have pip, you can install it by following the instructions at https://pip.pypa.io/en/latest/installing.html.
  4. Before you can compile the GDAL Python bindings, you will need to set a few environment variables so that Python can find the version of GDAL you installed. To do this, type the following in a Terminal window:
    export CPLUS_INCLUDE_PATH=/Library/Frameworks/GDAL.framework/Headers
    export C_INCLUDE_PATH=/Library/Frameworks/GDAL.framework/Headers
    export LIBRARY_PATH=/Library/Frameworks/GDAL.framework/Versions/1.11/unix/lib
    

    Make sure the version of GDAL in the LIBRARY_PATH variable definition matches the version you installed.

  5. You can now install the Python bindings by typing the following:
    pip install gdal==1.11
    

    Once again, make sure that the version number matches the version of GDAL you installed.

All going well, the Python bindings for GDAL should compile successfully. You will see a few compiler warnings, but hopefully no errors.

Installing GDAL on MS Windows

Unfortunately, there is currently no prebuilt installer for GDAL that uses Python 3. You can either attempt to build it yourself, or you will have to use Python 2. You can download a binary installer for GDAL that includes Python 2 support from http://www.gisinternals.com.

Because you won't be able to use Python 3, you will need to adjust the code you write to match the syntax for Python 2. For the code samples for this chapter, the only difference is that you don't use parentheses after the print keyword.

Testing your GDAL installation

After installing GDAL, you can check whether it works by typing import osgeo into the Python command line; if the Python command prompt reappears with no error message, it means GDAL was successfully installed, and you are all set to go:

>>> import osgeo
>>>

Examining the Downloaded Shapefile

Let's now use GDAL to look at the contents of the shapefile you downloaded earlier. You can either type the following directly into the command prompt, or else save it as a Python script so that you can run it whenever you wish to (let's call this script analyze.py):

import osgeo.ogr

shapefile = osgeo.ogr.Open("tl_2014_us_state.shp")
numLayers = shapefile.GetLayerCount()

print("Shapefile contains {} layers".format(numLayers))
print()

for layerNum in range(numLayers):
  layer = shapefile.GetLayer(layerNum)
  spatialRef = layer.GetSpatialRef().ExportToProj4()
  numFeatures = layer.GetFeatureCount()
  print("Layer {} has spatial reference {}".format(
        layerNum, spatialRef))
  print("Layer {} has {} features:".format(
        layerNum, numFeatures))
  print()

  for featureNum in range(numFeatures):
    feature = layer.GetFeature(featureNum)
    featureName = feature.GetField("NAME")

    print("Feature {} has name {}".
          format(featureNum, featureName))

Tip

This example assumes you've placed this script in the same directory as the tl_2014_us_state.shp file. If you've put it in a different directory, change the osgeo.ogr.Open() command to include the path to your shapefile. If you are running this on MS Windows, don't forget to use double backslash characters (\\) as directory separators.

Running this script will give us a quick summary of how the shapefile's data is structured:

Shapefile contains 1 layers

Layer 0 has spatial reference +proj=longlat +datum=NAD83 +no_defs
Layer 0 has 56 features:

Feature 0 has name West Virginia
Feature 1 has name Florida
Feature 2 has name Illinois
Feature 3 has name Minnesota
...
Feature 53 has name District of Columbia
Feature 54 has name Iowa
Feature 55 has name Arizona

This shows us that the data we downloaded consists of one layer, with 56 individual features corresponding to the various states and protectorates in the USA. It also tells us the "spatial reference" for this layer, which tells us that the coordinates are represented as latitude and longitude values using the NAD83 datum.

As you can see from the example, using GDAL to extract data from shapefiles is quite straightforward. Let's continue with another example. This time, we'll look at the details of feature number 12, New Mexico. Let's call this program analyze2.py:

import osgeo.ogr

shapefile = osgeo.ogr.Open("tl_2014_us_state.shp")
layer = shapefile.GetLayer(0)
feature = layer.GetFeature(12)

print("Feature 12 has the following attributes:")
print()

attributes = feature.items()

for key,value in attributes.items():
  print("  {} = {}".format(key, value))

geometry = feature.GetGeometryRef()
geometryName = geometry.GetGeometryName()

print()
print("Feature's geometry data consists of a {}".format(
      geometryName))

Running this produces the following output:

Feature 12 has the following attributes:

 STUSPS = NM
 GEOID = 35
 REGION = 4
 ALAND = 314161410324.0
 MTFCC = G4000
 FUNCSTAT = A
 NAME = New Mexico
 STATEFP = 35
 LSAD = 00
 STATENS = 00897535
 AWATER = 755712514.0
 DIVISION = 8
 INTPTLON = -106.1316181
 INTPTLAT = +34.4346843

Feature's geometry data consists of a POLYGON

The meaning of the various attributes is described on the US Census Bureau's web site, but what interests us right now is the feature's geometry. A geometry object is a complex structure that holds some geospatial data, often using nested geometry objects to reflect the way the geospatial data is organized. So far, we've discovered that New Mexico's geometry consists of a polygon. Let's now take a closer look at this polygon, using a program we'll call analyze3.py:

import osgeo.ogr

def analyzeGeometry(geometry, indent=0):
  s = []
  s.append("  " * indent)
  s.append(geometry.GetGeometryName())
  if geometry.GetPointCount() > 0:
    s.append(" with {} data points".format(geometry.GetPointCount()))
  if geometry.GetGeometryCount() > 0:
    s.append(" containing:")

  print("".join(s))

  for i in range(geometry.GetGeometryCount()):
    analyzeGeometry(geometry.GetGeometryRef(i), indent+1)

shapefile = osgeo.ogr.Open("tl_2014_us_state.shp")
layer = shapefile.GetLayer(0)
feature = layer.GetFeature(12)
geometry = feature.GetGeometryRef()

analyzeGeometry(geometry)

The recursive analyzeGeometry() function reveals the structure for this geometry:

POLYGON containing:
 LINEARRING with 7554 data points

As you can see, this geometry has a single linear ring, defining the outline of New Mexico. If we ran the same program over California (feature 13 in our shapefile), the output would be somewhat more complicated:

MULTIPOLYGON containing:
 POLYGON containing:
 LINEARRING with 121 data points
 POLYGON containing:
 LINEARRING with 191 data points
 POLYGON containing:
 LINEARRING with 77 data points
 POLYGON containing:
 LINEARRING with 152 data points
 POLYGON containing:
 LINEARRING with 392 data points
 POLYGON containing:
 LINEARRING with 93 data points
 POLYGON containing:
 LINEARRING with 10105 data points

As you can see, California is made up of seven distinct polygons, each defined by a single linear ring. This is because California is on the coast and includes six outlying islands as well as the main inland body of the state.

Let's finish this analysis of the US state shapefile by answering a simple question: what is the distance from the northernmost point to the southernmost point in California? There are various ways we could answer this question, but for now, we'll do it by hand. Let's start by identifying the northernmost and southernmost points in California:

import osgeo.ogr

def findPoints(geometry, results):
  for i in range(geometry.GetPointCount()):
    x,y,z = geometry.GetPoint(i)
    if results['north'] == None or results['north'][1] < y:
      results['north'] = (x,y)
    if results['south'] == None or results['south'][1] > y:
      results['south'] = (x,y)

  for i in range(geometry.GetGeometryCount()):
    findPoints(geometry.GetGeometryRef(i), results)

shapefile = osgeo.ogr.Open("tl_2014_us_state.shp")
layer = shapefile.GetLayer(0)
feature = layer.GetFeature(13)
geometry = feature.GetGeometryRef()

results = {'north' : None,
           'south' : None}

findPoints(geometry, results)

print("Northernmost point is ({:.4f}, {:.4f})".format(
      results['north'][0], results['north'][1]))
print("Southernmost point is ({:.4f}, {:.4f})".format(
      results['south'][0], results['south'][1]))

The findPoints() function recursively scans through a geometry, extracting the individual points and identifying the points with the highest and lowest y (latitude) values, which are then stored in the results dictionary so that the main program can use these values.

As you can see, GDAL makes it easy to work with the complex geometry data structure. The code does require recursion, but it is still trivial compared with trying to read the data directly. If you run the above program, the following output will be displayed:

Northernmost point is (-122.3782, 42.0095)
Southernmost point is (-117.2049, 32.5288)

Now that we have these two points, we want to calculate the distance between them. As described earlier, we have to use a great-circle distance calculation here to allow for the curvature of the Earth's surface. We'll do this manually, using the haversine formula:

import math

lat1 = 42.0095
long1 = -122.3782

lat2 = 32.5288
long2 = -117.2049

rLat1 = math.radians(lat1)
rLong1 = math.radians(long1)
rLat2 = math.radians(lat2)
rLong2 = math.radians(long2)

dLat = rLat2 - rLat1
dLong = rLong2 - rLong1
a = math.sin(dLat/2)**2 + math.cos(rLat1) * math.cos(rLat2) \
                        * math.sin(dLong/2)**2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
distance = 6371 * c

print("Great circle distance is {:0.0f} kilometers".format(
      distance))

Don't worry about the complex maths involved here; basically, we are converting the latitude and longitude values to radians, calculating the difference in latitude/longitude values between the two points, and then passing the results through some trigonometric functions to obtain the great-circle distance. The value of 6371 is the radius of the Earth, in kilometers.

More details about the haversine formula can be found at http://mathforum.org/library/drmath/view/51879.html.

If you run the program, your computer will tell you the distance from the northernmost to the southernmost point in California:

Great circle distance is 1149 kilometers

There are, of course, other ways of calculating this. You wouldn't normally type the haversine formula directly into your program, as there are libraries that will do this for you. But we deliberately did the calculation this way to show how it can be done.

If you would like to explore this further, you might like to try writing programs to calculate the following:

  • The easternmost and westernmost points in California
  • The midpoint of California

    Tip

    Hint: You can calculate the midpoint's longitude by taking the average of the easternmost and westernmost longitudes.

  • The midpoint of Arizona
  • The distance between the middle of California and the middle of Arizona

As you can see, working with GIS data manually isn't too onerous. While the data structures and maths involved can be rather complex, and you do have to be aware of the underlying GIS concepts, using tools such as GDAL makes your data accessible and easy to work with.