- Python Geospatial Development(Third Edition)
- Erik Westra
- 1314字
- 2025-02-17 16:51:50
Reading and writing geospatial data
While you could, in theory, write your own parser to read a particular geospatial data format, it is much easier to use an existing Python library to do this. We will look at two popular libraries for reading and writing geospatial data: GDAL and OGR.
GDAL/OGR
Unfortunately, the naming of these two libraries is rather confusing. GDAL (short for Geospatial Data Abstraction Library), was originally just a library for working with raster-format geospatial data, while the separate OGR library was intended to work with vector-format data. However, the two libraries are now partially merged and generally downloaded and installed together under the combined name of GDAL. To avoid confusion, we will call this combined library GDAL/OGR and use GDAL to refer to just the raster translation library.
A default installation of GDAL allows you to read data in 100 different raster file formats, and write data in 71 different formats. OGR, by default, supports reading data in 42 different vector file formats and writing data in 39 different formats. This makes GDAL/OGR one of the most powerful geospatial data translators available, and certainly the most useful freely-available library for reading and writing geospatial data.
Installing GDAL/OGR
We saw how to install GDAL/OGR in the Working with GIS data manually section in the previous chapter. Refer to the instructions in that section for more information on installing GDAL and OGR onto your computer.
Understanding GDAL
To read and write raster-format data, the GDAL library makes use of a dataset object. The dataset object, an instance of gdal.Dataset
, represents a file containing raster-format data.
Each dataset is broken down into multiple bands, where each band contains a single piece of raster data. For example, a raster dataset representing a basemap image might consist of three bands, representing the red, green, and blue color components of the overall image:
data:image/s3,"s3://crabby-images/4409d/4409dc2d516bd418b850ef0a5879ee3815857083" alt="Understanding GDAL"
Of course, bands aren't limited to storing RGB color values; they can hold any information you like. For example, one band might store elevation, a second band might store soil moisture levels, and a third band might hold a code indicating the type of vegetation found at that location. Each band within a raster dataset is represented by a gdal.Band
object.
For each band, the raster data consists of a number of cells arranged into rows and columns:
data:image/s3,"s3://crabby-images/e1c15/e1c153cd5065ee03bff40c7b8aaf9d8e8ea4b7dc" alt="Understanding GDAL"
Each cell is then georeferenced onto a portion of the Earth's surface:
data:image/s3,"s3://crabby-images/1e941/1e941c0b03020b89951bbcb9064420ee4eeb221f" alt="Understanding GDAL"
Within a band, each cell consists of a simple integer or floating-point value.
While raster datasets cover a rectangular area of the Earth's surface, there may not be a value for every cell. For example, a dataset that contains information for a particular country will not have values for the cells that lie beyond that country's borders, as shown in the following figure:
data:image/s3,"s3://crabby-images/99095/9909581284b4998b602cbf1b5190c2054392b3a2" alt="Understanding GDAL"
Each X in this figure represents a cell that doesn't have a value. To allow this, GDAL supports the concept of a no data value. This is a special value which is stored in the cell when the cell doesn't have an actual value associated with it.
Because each raster band typically has thousands (or even millions) of cells, you don't normally access the cells one at a time. Instead, you can use one of two possible techniques to access the band's data en masse:
- You can use the
band.ReadRaster()
method to read some or all of the band's data into a binary string and then use the built-instruct
library to convert that string into an array of values - You can use the
band.ReadArray()
method to read some of the band's data, storing the cell values directly into aNumPy
array object
Both approaches have their advantages and disadvantages. Using struct
allows you to access cell values as integers or floating-point values directly, but you can only easily extract one row of data at a time. Using the NumPy array approach makes it easier to read multiple rows of data at once but requires you to install the third-party NumPy library. NumPy also has the disadvantage that it stores each value as a custom NumPy object. For example, 16-bit integer values are stored as numpy.uint16
objects; NumPy automatically converts these to integers as required, but doing so can slow your program down, so you may need to do this conversion yourself.
Note that the gdal.Band
object has corresponding WriterRaster()
and WriteArray()
methods, allowing you to use either approach for writing raster data as well as reading it.
GDAL example code
Let's take a look at how you can use GDAL to write raster-format data into a GeoTIFF formatted file. For our example, we'll divide the entire Earth's surface into 360 cells horizontally and 180 cells vertically, so that each cell covers one degree of latitude and longitude. For each cell, we'll store a single random number between 1 and 100.
We'll start by creating the file itself. To do this, we first ask GDAL to select a driver that can create the type of file we want, and then ask the driver to create the file:
from osgeo import gdal driver = gdal.GetDriverByName("GTIFF") dstFile = driver.Create("Example Raster.tiff", 360, 180, 1, gdal.GDT_INT16)
The gdal.Driver.Create()
method takes the following parameters:
- The name of the raster file.
- The desired number of cells across.
- The desired number of cells down.
- The desired number of bands in the file.
- A constant defining the type of information to store in each cell. In our case, each cell will hold a 16-bit integer value.
Our next task is to set the projection that will be used to position our cells onto the surface of the earth. In our case, we will use WGS84, allowing us to refer to each cell's position using latitude and longitude values:
from gdal import osr spatialReference = osr.SpatialReference() spatialReference.SetWellKnownGeogCS("WGS84") dstFile.SetProjection(spatialReference.ExportToWkt())
We next have to define the georeferencing transform to use for our data. This transform tells GDAL how to map each cell onto the surface of the Earth. The georeferencing transform is defined as a list of six numbers, known as an affine transformation matrix. Fortunately, we can ignore the mathematics of affine transformations and define our matrix using just four values:
- The X and Y position of the top-left corner of the top-left cell
- The width of each cell, measured in degrees of longitude
- The height of each cell, measured in degrees of latitude
Here is the Python code to set our file's georeferencing transform:
originX = -180 originY = 90 cellWidth = 1.0 cellHeight = 1.0 dstFile.SetGeoTransform([originX, cellWidth, 0, originY, 0, -cellHeight])
While we're at it, let's get a reference to the gdal.Band
object we'll need to store data into our file's one and only raster band:
band = dstFile.GetRasterBand(1)
Now that we have set up our raster-format file, let's create some data to store into it. We'll use the built-in random
module to create a list-of-lists holding one random number for each cell:
import random values = [] for row in range(180): row_data = [] for col in range(360): row_data.append(random.randint(1, 100)) values.append(row_data)
To save these values to the file using the built-in struct
module, we have to write them to disk one row at a time. Here's the code for doing this:
import struct fmt = "<" + ("h" * band.XSize) for row in range(180): scanline = struct.pack(fmt, *values[row]) band.WriteRaster(0, row, 360, 1, scanline)
Alternatively, if you have NumPy installed, you can write the values
array directly to the file as a NumPy array:
import numpy array = numpy.array(values, dtype=numpy.int16) band.WriteArray(array)
This completes our program. If you run it, you will create a file named Example Raster.tiff
, which contains 360 x 180 random numbers. Let's now write a separate program that can read the contents of this file.
Fortunately, reading raster data is more straightforward than writing it. We'll start by opening the dataset and getting a reference to our (one and only) raster band:
from osgeo import gdal srcFile = gdal.Open("Example Raster.tiff") band = srcFile.GetRasterBand(1)
We're now ready to read the contents of the file. Here is how we can do this using the built-in struct
module:
import struct fmt = "<" + ("h" * band.XSize) for row in range(band.YSize): scanline = band.ReadRaster(0, row, band.XSize, 1, band.XSize, 1, band.DataType) row_data = struct.unpack(fmt, scanline) print(row_data)
Note
Don't worry about all the parameters to the ReadRaster
method; because this method can scale the data as it is being read, we have to supply the number of cells across and down twice, as well as the starting cell position. You can look up the definition of this method in the GDAL documentation if you want to know the details.
As you can see, we're simply printing out the returned raster values so that you can see them. Let's do the same thing again, this time using NumPy:
values = band.ReadAsArray() for row in range(band.XSize): print(values[row])
As you can see, reading raster data using GDAL is quite straightforward once you get past the complexity of converting the raw binary data back into Python using either struct
or NumPy.
Understanding OGR
As we have seen, OGR is used to read and write vector-format geospatial data. OGR uses the term data source to represent a file or other source of vector-format data. A data source consists of one or more layers, where each layer represents a set of related information. While many data sources consist of only one layer, other data sources may be more complex. For example, a single complex data source might store county outlines, roads, city boundaries, and the location of hospitals, each as a separate layer within the data source.
Each layer is made up of a list of features. A feature represents a single item that is being described. For example, for a "cities" layer, each feature would represent a single city.
For each feature, the layer has a geometry , which is the actual piece of geospatial data representing the feature's shape or position in space, along with a list of attributes that provide additional meta-information about the feature.
Finally, in addition to the features themselves, each layer has a spatial reference that identifies the projection and datum used by the layer's features.
The following illustration shows how all these elements relate to each other:
data:image/s3,"s3://crabby-images/267cb/267cbaeda8899ea6d22a72f3799d0d7d7b4dbe73" alt="Understanding OGR"
While this may seem like quite a complex design, the versatility of this approach allows OGR to work with all sorts of different vector data formats. For example, the shapefile format does not support multiple layers, but OGR uses a consistent model that represents a shapefile as a data source consisting of only one layer.
OGR example code
Since you learned how to read from a shapefile using OGR in the previous chapter, let's now take a closer look at how you can write vector-format data using OGR. We'll write a simple Python program that generates a set of random Point geometries and stores them, along with some attribute values, in a shapefile.
Let's start by importing the required modules and creating a directory to hold the generated shapefile:
import os, os.path, shutil, random from osgeo import ogr, osr if os.path.exists("test-shapefile"): shutil.rmtree("test-shapefile") os.mkdir("test-shapefile")
Notice that we delete our shapefile's directory if it already exists. This allows us to run the program multiple times without having to remember to delete the shapefile each time.
We can now set up our shapefile as an OGR data source. As with GDAL, we first select a driver for the type of data source we want to create and then ask the driver to create the data source. Here is the relevant code:
driver = ogr.GetDriverByName("ESRI Shapefile") path = os.path.join("test-shapefile", "shapefile.shp") datasource = driver.CreateDataSource(path)
Next, we want to define our layer. To do this, we first have to define the spatial reference we will use for our data:
spatialReference = osr.SpatialReference() spatialReference.SetWellKnownGeogCS('WGS84')
We can then create the layer itself:
layer = datasource.CreateLayer("layer", spatialReference)
Since we want to store attribute information into our file, we next need to tell OGR what those attributes will be and what type of data to store. This information is added directly to the map layer using the ogr.FieldDefn
class:
field = ogr.FieldDefn("ID", ogr.OFTInteger) field.SetWidth(4) layer.CreateField(field) field = ogr.FieldDefn("NAME", ogr.OFTString) field.SetWidth(20) layer.CreateField(field)
Now that we've created our shapefile, let's store some information in it. We're going to generate 100 Point geometries and calculate an ID and name to associate with each feature. We'll use the built-in random
module to do this:
for i in range(100): id = 1000 + i lat = random.uniform(-90, +90) long = random.uniform(-180, +180) name = "point-{}".format(i) wkt = "POINT({} {})".format(long, lat) geometry = ogr.CreateGeometryFromWkt(wkt)
As you can see, we define the Point geometry as a WKT formatted string, which we then convert into an ogr.Geometry
object. We next need to create an ogr.Feature
object for each of our points and store the geometry and attribute values into the feature:
feature = ogr.Feature(layer.GetLayerDefn()) feature.SetGeometry(geometry) feature.SetField("ID", id) feature.SetField("NAME", name)
Finally, we have to tell the layer to create the feature and store it into the file:
layer.CreateFeature(feature)
This completes our program. If you run it, a directory named test-shapefile
will be created, containing the various files that hold the shapefile's data.
To read through the contents of our generated shapefile, we use the same technique we covered in the previous chapter. Here is a simple program which will load the various Point geometries back into memory and display the coordinates as well as the attributes we stored earlier:
from osgeo import ogr shapefile = ogr.Open("test-shapefile/shapefile.shp") layer = shapefile.GetLayer(0) for i in range(layer.GetFeatureCount()): feature = layer.GetFeature(i) id = feature.GetField("ID") name = feature.GetField("NAME") geometry = feature.GetGeometryRef() print(i, name, geometry.GetX(), geometry.GetY())
As you can see, using OGR to work with vector data sources is not difficult at all.
GDAL/OGR documentation
GDAL and OGR are well documented, but there is a catch for Python programmers. The GDAL/OGR library and associated command-line tools are all written in C and C++. Bindings are available that allow access from a variety of other languages, including Python, but the documentation is all written for the C++ version of the libraries. This can make reading the documentation rather challenging—not only are all the method signatures written in C++, but the Python bindings have changed many of the method and class names to make them more "Pythonic".
Fortunately, the Python libraries are largely self-documenting, thanks to all the docstrings embedded in the Python bindings themselves. This means you can explore the documentation using tools such as Python's built-in pydoc
utility, which can be run from the command line like this:
pydoc -g osgeo
This will open up a GUI window allowing you to read the documentation using a web browser. Alternatively, if you want to find out about a single method or class, you can use Python's built-in help()
command from the Python command line, like this:
>>> from osgeo import ogr >>> help(ogr.DataSource.CopyLayer)
Not all the methods are documented, so you may need to refer to the C++ docs on the GDAL web site for more information, and some of the docstrings that are there are copied directly from the C++ documentation—but the documentation for GDAL/OGR is excellent in general and should allow you to quickly come up to speed with using this library.