Reading and writing geospatial data

While you could, in theory, write your own parser to read a particular geospatial data format, it is much easier to use an existing Python library to do this. We will look at two popular libraries for reading and writing geospatial data: GDAL and OGR.

GDAL/OGR

Unfortunately, the naming of these two libraries is rather confusing. GDAL (short for Geospatial Data Abstraction Library), was originally just a library for working with raster-format geospatial data, while the separate OGR library was intended to work with vector-format data. However, the two libraries are now partially merged and generally downloaded and installed together under the combined name of GDAL. To avoid confusion, we will call this combined library GDAL/OGR and use GDAL to refer to just the raster translation library.

A default installation of GDAL allows you to read data in 100 different raster file formats, and write data in 71 different formats. OGR, by default, supports reading data in 42 different vector file formats and writing data in 39 different formats. This makes GDAL/OGR one of the most powerful geospatial data translators available, and certainly the most useful freely-available library for reading and writing geospatial data.

Installing GDAL/OGR

We saw how to install GDAL/OGR in the Working with GIS data manually section in the previous chapter. Refer to the instructions in that section for more information on installing GDAL and OGR onto your computer.

Understanding GDAL

To read and write raster-format data, the GDAL library makes use of a dataset object. The dataset object, an instance of gdal.Dataset, represents a file containing raster-format data.

Each dataset is broken down into multiple bands, where each band contains a single piece of raster data. For example, a raster dataset representing a basemap image might consist of three bands, representing the red, green, and blue color components of the overall image:

Understanding GDAL

Of course, bands aren't limited to storing RGB color values; they can hold any information you like. For example, one band might store elevation, a second band might store soil moisture levels, and a third band might hold a code indicating the type of vegetation found at that location. Each band within a raster dataset is represented by a gdal.Band object.

For each band, the raster data consists of a number of cells arranged into rows and columns:

Understanding GDAL

Each cell is then georeferenced onto a portion of the Earth's surface:

Understanding GDAL

Within a band, each cell consists of a simple integer or floating-point value.

While raster datasets cover a rectangular area of the Earth's surface, there may not be a value for every cell. For example, a dataset that contains information for a particular country will not have values for the cells that lie beyond that country's borders, as shown in the following figure:

Understanding GDAL

Each X in this figure represents a cell that doesn't have a value. To allow this, GDAL supports the concept of a no data value. This is a special value which is stored in the cell when the cell doesn't have an actual value associated with it.

Because each raster band typically has thousands (or even millions) of cells, you don't normally access the cells one at a time. Instead, you can use one of two possible techniques to access the band's data en masse:

  • You can use the band.ReadRaster() method to read some or all of the band's data into a binary string and then use the built-in struct library to convert that string into an array of values
  • You can use the band.ReadArray() method to read some of the band's data, storing the cell values directly into a NumPy array object

Both approaches have their advantages and disadvantages. Using struct allows you to access cell values as integers or floating-point values directly, but you can only easily extract one row of data at a time. Using the NumPy array approach makes it easier to read multiple rows of data at once but requires you to install the third-party NumPy library. NumPy also has the disadvantage that it stores each value as a custom NumPy object. For example, 16-bit integer values are stored as numpy.uint16 objects; NumPy automatically converts these to integers as required, but doing so can slow your program down, so you may need to do this conversion yourself.

Note that the gdal.Band object has corresponding WriterRaster() and WriteArray() methods, allowing you to use either approach for writing raster data as well as reading it.

GDAL example code

Let's take a look at how you can use GDAL to write raster-format data into a GeoTIFF formatted file. For our example, we'll divide the entire Earth's surface into 360 cells horizontally and 180 cells vertically, so that each cell covers one degree of latitude and longitude. For each cell, we'll store a single random number between 1 and 100.

We'll start by creating the file itself. To do this, we first ask GDAL to select a driver that can create the type of file we want, and then ask the driver to create the file:

from osgeo import gdal
driver = gdal.GetDriverByName("GTIFF")
dstFile = driver.Create("Example Raster.tiff", 360, 180, 1,
                        gdal.GDT_INT16)

Note

While this program is presented in sections so that that each step can be explained, the complete source code for the program is included in the example programs that can be downloaded for this chapter. Look for a file named writeRaster.py.

The gdal.Driver.Create() method takes the following parameters:

  • The name of the raster file.
  • The desired number of cells across.
  • The desired number of cells down.
  • The desired number of bands in the file.
  • A constant defining the type of information to store in each cell. In our case, each cell will hold a 16-bit integer value.

Our next task is to set the projection that will be used to position our cells onto the surface of the earth. In our case, we will use WGS84, allowing us to refer to each cell's position using latitude and longitude values:

from gdal import osr
spatialReference = osr.SpatialReference()
spatialReference.SetWellKnownGeogCS("WGS84")
dstFile.SetProjection(spatialReference.ExportToWkt())

We next have to define the georeferencing transform to use for our data. This transform tells GDAL how to map each cell onto the surface of the Earth. The georeferencing transform is defined as a list of six numbers, known as an affine transformation matrix. Fortunately, we can ignore the mathematics of affine transformations and define our matrix using just four values:

  • The X and Y position of the top-left corner of the top-left cell
  • The width of each cell, measured in degrees of longitude
  • The height of each cell, measured in degrees of latitude

Here is the Python code to set our file's georeferencing transform:

originX    = -180
originY    = 90
cellWidth  = 1.0
cellHeight = 1.0

dstFile.SetGeoTransform([originX, cellWidth, 0,
                         originY, 0, -cellHeight])

While we're at it, let's get a reference to the gdal.Band object we'll need to store data into our file's one and only raster band:

band = dstFile.GetRasterBand(1)

Now that we have set up our raster-format file, let's create some data to store into it. We'll use the built-in random module to create a list-of-lists holding one random number for each cell:

import random

values = []
for row in range(180):
    row_data = []
    for col in range(360):
        row_data.append(random.randint(1, 100))
    values.append(row_data)

To save these values to the file using the built-in struct module, we have to write them to disk one row at a time. Here's the code for doing this:

import struct

fmt = "<" + ("h" * band.XSize)

for row in range(180):
    scanline = struct.pack(fmt, *values[row])
    band.WriteRaster(0, row, 360, 1, scanline)

Alternatively, if you have NumPy installed, you can write the values array directly to the file as a NumPy array:

import numpy

array = numpy.array(values, dtype=numpy.int16)
band.WriteArray(array)

This completes our program. If you run it, you will create a file named Example Raster.tiff, which contains 360 x 180 random numbers. Let's now write a separate program that can read the contents of this file.

Fortunately, reading raster data is more straightforward than writing it. We'll start by opening the dataset and getting a reference to our (one and only) raster band:

from osgeo import gdal

srcFile = gdal.Open("Example Raster.tiff")
band = srcFile.GetRasterBand(1)

Note

This program is called readRaster.py and is included in the example programs that can be downloaded for this chapter.

We're now ready to read the contents of the file. Here is how we can do this using the built-in struct module:

import struct

fmt = "<" + ("h" * band.XSize)

for row in range(band.YSize):
    scanline = band.ReadRaster(0, row, band.XSize, 1,
                               band.XSize, 1,
                               band.DataType)
    row_data = struct.unpack(fmt, scanline)
    print(row_data)

Note

Don't worry about all the parameters to the ReadRaster method; because this method can scale the data as it is being read, we have to supply the number of cells across and down twice, as well as the starting cell position. You can look up the definition of this method in the GDAL documentation if you want to know the details.

As you can see, we're simply printing out the returned raster values so that you can see them. Let's do the same thing again, this time using NumPy:

values = band.ReadAsArray()
for row in range(band.XSize):
    print(values[row])

Note

Note that the ReadAsArray() method is only available if you have NumPy installed.

As you can see, reading raster data using GDAL is quite straightforward once you get past the complexity of converting the raw binary data back into Python using either struct or NumPy.

Understanding OGR

As we have seen, OGR is used to read and write vector-format geospatial data. OGR uses the term data source to represent a file or other source of vector-format data. A data source consists of one or more layers, where each layer represents a set of related information. While many data sources consist of only one layer, other data sources may be more complex. For example, a single complex data source might store county outlines, roads, city boundaries, and the location of hospitals, each as a separate layer within the data source.

Each layer is made up of a list of features. A feature represents a single item that is being described. For example, for a "cities" layer, each feature would represent a single city.

For each feature, the layer has a geometry , which is the actual piece of geospatial data representing the feature's shape or position in space, along with a list of attributes that provide additional meta-information about the feature.

Finally, in addition to the features themselves, each layer has a spatial reference that identifies the projection and datum used by the layer's features.

The following illustration shows how all these elements relate to each other:

Understanding OGR

While this may seem like quite a complex design, the versatility of this approach allows OGR to work with all sorts of different vector data formats. For example, the shapefile format does not support multiple layers, but OGR uses a consistent model that represents a shapefile as a data source consisting of only one layer.

OGR example code

Since you learned how to read from a shapefile using OGR in the previous chapter, let's now take a closer look at how you can write vector-format data using OGR. We'll write a simple Python program that generates a set of random Point geometries and stores them, along with some attribute values, in a shapefile.

Let's start by importing the required modules and creating a directory to hold the generated shapefile:

import os, os.path, shutil, random
from osgeo import ogr, osr

if os.path.exists("test-shapefile"):
    shutil.rmtree("test-shapefile")
os.mkdir("test-shapefile")

Note

The complete source code for this program is included in the example files for this chapter, in the file named writeVector.py.

Notice that we delete our shapefile's directory if it already exists. This allows us to run the program multiple times without having to remember to delete the shapefile each time.

We can now set up our shapefile as an OGR data source. As with GDAL, we first select a driver for the type of data source we want to create and then ask the driver to create the data source. Here is the relevant code:

driver = ogr.GetDriverByName("ESRI Shapefile")
path = os.path.join("test-shapefile", "shapefile.shp")
datasource = driver.CreateDataSource(path)

Next, we want to define our layer. To do this, we first have to define the spatial reference we will use for our data:

spatialReference = osr.SpatialReference()
spatialReference.SetWellKnownGeogCS('WGS84')

We can then create the layer itself:

layer = datasource.CreateLayer("layer", spatialReference)

Since we want to store attribute information into our file, we next need to tell OGR what those attributes will be and what type of data to store. This information is added directly to the map layer using the ogr.FieldDefn class:

field = ogr.FieldDefn("ID", ogr.OFTInteger)
field.SetWidth(4)
layer.CreateField(field)

field = ogr.FieldDefn("NAME", ogr.OFTString)
field.SetWidth(20)
layer.CreateField(field)

Now that we've created our shapefile, let's store some information in it. We're going to generate 100 Point geometries and calculate an ID and name to associate with each feature. We'll use the built-in random module to do this:

for i in range(100):
    id   = 1000 + i
    lat  = random.uniform(-90, +90)
    long = random.uniform(-180, +180)
    name = "point-{}".format(i)

    wkt = "POINT({} {})".format(long, lat)
    geometry = ogr.CreateGeometryFromWkt(wkt)

As you can see, we define the Point geometry as a WKT formatted string, which we then convert into an ogr.Geometry object. We next need to create an ogr.Feature object for each of our points and store the geometry and attribute values into the feature:

    feature = ogr.Feature(layer.GetLayerDefn())
    feature.SetGeometry(geometry)
    feature.SetField("ID", id)
    feature.SetField("NAME", name)

Finally, we have to tell the layer to create the feature and store it into the file:

    layer.CreateFeature(feature)

This completes our program. If you run it, a directory named test-shapefile will be created, containing the various files that hold the shapefile's data.

To read through the contents of our generated shapefile, we use the same technique we covered in the previous chapter. Here is a simple program which will load the various Point geometries back into memory and display the coordinates as well as the attributes we stored earlier:

from osgeo import ogr

shapefile = ogr.Open("test-shapefile/shapefile.shp")
layer = shapefile.GetLayer(0)

for i in range(layer.GetFeatureCount()):
    feature  = layer.GetFeature(i)
    id       = feature.GetField("ID")
    name     = feature.GetField("NAME")
    geometry = feature.GetGeometryRef()
    print(i, name, geometry.GetX(), geometry.GetY())

Note

This program is available in the example code under the name readVector.py.

As you can see, using OGR to work with vector data sources is not difficult at all.

GDAL/OGR documentation

GDAL and OGR are well documented, but there is a catch for Python programmers. The GDAL/OGR library and associated command-line tools are all written in C and C++. Bindings are available that allow access from a variety of other languages, including Python, but the documentation is all written for the C++ version of the libraries. This can make reading the documentation rather challenging—not only are all the method signatures written in C++, but the Python bindings have changed many of the method and class names to make them more "Pythonic".

Fortunately, the Python libraries are largely self-documenting, thanks to all the docstrings embedded in the Python bindings themselves. This means you can explore the documentation using tools such as Python's built-in pydoc utility, which can be run from the command line like this:

pydoc -g osgeo

This will open up a GUI window allowing you to read the documentation using a web browser. Alternatively, if you want to find out about a single method or class, you can use Python's built-in help() command from the Python command line, like this:

>>> from osgeo import ogr
>>> help(ogr.DataSource.CopyLayer)

Not all the methods are documented, so you may need to refer to the C++ docs on the GDAL web site for more information, and some of the docstrings that are there are copied directly from the C++ documentation—but the documentation for GDAL/OGR is excellent in general and should allow you to quickly come up to speed with using this library.