Up-to-date Region of OpenStreetMap Data

One of the things about OpenStreetMap that I find fascinating is the prospect of maintaining a routinely updated copy of the map data in your own environment. Below I’ve documented the steps I took to use readily available tools to prepare an updatable database of OSM’s San Francisco data. These steps are inspired by the book “OpenStreetMap: Using and Enhancing the Free Map of the World” by Frederik Ramm, Jochen Topf, and Steve Chilton and on the wiki page for osmosis and other tools, and a bit of mailing list correspondence. I work on an Ubuntu 12.04 machine, typically at the command line. You may have to adapt these steps to work for you.

  1. Install dependencies.
    sudo apt-get install postgresql-9.1 postgresql-9.1-postgis postgresql-contrib
  2. Pick a working directory. I use ~/osmsf in this example.
    mkdir ~/osmsf
    cd ~/osmsf
  3. Install osmosis, the OSM swiss army knife. Here are the steps to install the latest stable release:
    cd ~/osmsf
    wget http://bretth.dev.openstreetmap.org/osmosis-build/osmosis-latest.tgz
    tar xvf osmosis-latest.tgz
    cp -ra bin lib ~

    For this last bit to do the right thing, you’ll need ~/bin in your $PATH. Now check that you can successfully run osmosis:

    osmosis --help

    You also may wish to configure osmosis to run with more memory by default. I set mine to use 16GB.

    echo 'JAVACMD_OPTIONS="-Xmx16G"' > ~/.osmosis
  4. Grab some OSM data. In order to keep download and processing time manageable, I start with the geofabrik california extract.
    cd ~/osmsf
    wget http://download.geofabrik.de/openstreetmap/north-america/us/california.osm.pbf
    

    Note that I’ve pulled the protocol-buffer binary format, which is more compact than the bz2 xml. You should also go poke around the place from which you downloaded the file and check the timestamp of the file. For me, the california extract was created at 26-Feb-2013 01:03. This is important for later.

  5. Set up postgres, if you haven’t already. I usually create a user for myself that has ample rights. Later, if you want an automated process, or a web application, or whatever to be able to access the database with more restrictive permissions, you can add a suitable user.

    sudo su postgres
    createuser -P -r -d $USER
    

    I also create a postgis template DB so that I can create postgis databases as my preferred user.

    createdb -E UTF8 template_postgis
    psql -d postgres -c "UPDATE pg_database SET datistemplate='true' WHERE datname='template_postgis';"
    psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-1.5/postgis.sql
    psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-1.5/spatial_ref_sys.sql
    psql -d template_postgis -c "CREATE EXTENSION hstore;"
    psql -d template_postgis -c "GRANT ALL ON geometry_columns TO PUBLIC;" 
    psql -d template_postgis -c "GRANT ALL ON geography_columns TO PUBLIC;" 
    psql -d template_postgis -c "GRANT ALL ON spatial_ref_sys TO PUBLIC;" 
    exit
    

    Thanks to Miguel Araujo for this tip.

  6. Create a suitable database:
    createdb -T template_postgis osmsf
    psql -d osmsf -f ~/src/osmsf/script/pgsnapshot_schema_0.6.sql
    psql -d osmsf -f ~/src/osmsf/script/pgsnapshot_schema_0.6_action.sql
    psql -d osmsf -f ~/src/osmsf/script/pgsnapshot_schema_0.6_bbox.sql
    psql -d osmsf -f ~/src/osmsf/script/pgsnapshot_schema_0.6_linestring.sql
    psql -d osmsf -f ~/src/osmsf/script/pgsnapshot_load_0.6.sql
    

    You might as well create a database auth file for osmosis now too. It’s just a text file with the following format:

    host=localhost
    database=osmsf
    user=yourusername
    password=yourpassword
    dbType=postgresql
    

    I call my file osmsf.auth. You can also supply these options on the osmosis command line if you prefer. And as always, ponder the implications of passwords floating around your computer in plain text.

  7. Fetch the ogr2poly utility, which we will use to create poly files from shape files.
    wget http://svn.openstreetmap.org/applications/utils/osm-extract/polygons/ogr2poly.py -O ~/bin/ogr2poly.py
    chmod +x ~/bin/ogr2poly.py
    

    Again, for this to work, ~/bin must be in your path.

  8. Create a poly file to cut out the region of interest. Note that I pull the san francisco county file from the census website. If you are adapting these instructions to a different region, you can poke around there for suitable shapefiles. Note that these are administrative boundaries, not cartographic boundaries. So in the particular case of San Francisco, the sole polygon in this file ends up including some water.
    wget http://www2.census.gov/geo/tiger/TIGER2009/06_CALIFORNIA/06075_San_Francisco_County/tl_2009_06075_cousub.zip
    unzip tl_2009_06075_cousub.zip
    python ogr2poly.py tl_2009_06075_cousub.shp
    mv tl_2009_06075_cousub_0.poly sf.poly
    
  9. Write the San Francisco data to the postgres database
    osmosis --rb california.osm.pbf --bp file=sf.poly --wp authFile=osmsf.auth
    

    If you’re happy with a static snapshot, you’re done. You can go and fire up your favorite postgres client, or maybe poke around using qgis.

  10. Create a configuration file for the replication update procedure.
    cd ~/osmsf
    osmosis --rrii
    

    That –rrii stands for “read replication interval init”. This command creates a file in your current working directory called configuration.txt that will be used by osmosis to know the URL where the updates come from. You can also set the maximum number of seconds of OSM data that should be fetched. The comments in the autogenerated config file explain this. Here’s the contents of the configuration file I prepared to do this:

    baseUrl=http://planet.openstreetmap.org/replication/minute/
    maxInterval = 120
    
  11. Fetch the state file needed by osmosis to know the state of the extract that you used. Remember earlier when I told you to note the timestamp on the extract file? This is where it comes in handy. Go to the replicate-sequences tool and plug in the timestamp value. Paste the text that you get back into a file called state.txt in your ~/osmsf directory.
  12. Fetch and apply updates
    osmosis --rri --simc --wpc authFile=osmsf.auth
    

    Note the simc, which stands for –simplify-change. This option condenses all of the changes to a particular entity into one change. Without this option, you’ll get errors regarding duplicate keys. Also note that the state.txt file will be updated by this command such that subsequent invocations will fetch the next minutely diffs. Finally, this command applies all of the changes in the changeset, even though we only want the ones that fall within San Francisco. So we’ll have to trim the DB back down to the desired region. The strategy is to create a change set containing all of the points outside of our polygon of interest, then subtract these changes from the database.

  13. Prepare an empty osm file. I call my empty.osm and here are its contents:
    <?xml version="1.0" encoding="UTF-8"?>
    <osm version="0.6" generator="CGImap 0.0.2">
    </osm>
    
  14. Now prepare an inverted version of the sf.poly polygon file. To do this, make a copy of sf.poly called sf-inv.poly and open that file in a text editor. Prepend a polygon of the entire earth at the beginning of the file, then invert and renumber all of the existing polygons. The modified file looks like this:
    tl_2009_06075_cousub_0
    1
       -1.800000E+02   1.800000E+02
        1.800000E+02   1.800000E+02
        1.800000E+02  -1.800000E+02
       -1.800000E+02  -1.800000E+02
    END
    !2
       -1.230011E+02   3.777205E+01
       -1.229975E+02   3.777078E+01
    ...
    

    The resulting file is a polygon representing the entire planet except for san francisco. You may wish to go study poly files if any of this is unclear.

  15. Now you can trim the database to only contain the san francisco data:
    osmosis --rx empty.osm --rp authFile=osmsf.auth --dd --bp file=sf-inv.poly --dc --wpc authFile=osmsf.auth
    

    Now some explanation. The –rx reads the empty.osm and turns it into an entity stream. The –rp reads the database as a dataset, then the –dd turns the dataset into an entity stream. So now we have two input entity streams. The –bp (for “bounding polygon”) cuts out anything not with the sf-inv.poly out of the entity stream from the database. So now we have two entity streams, an empty one, and one with all of the entities from the database that lie outside of san francisco. –dc (for “derive change”) creates a single change stream from these two entity streams. Finally, –wpc applies the changes back to the database. If you don’t really understand the difference between datasets, entity streams, and change streams, have a look at the osmosis documentation.

  16. And repeat. You can now repeat steps 12 and 15 over and over to fetch the latest updates and trim the db back down to size. These steps can also be automated using cron or similar.

Now, this method has some imperfections. Specifically, it may not be identical to the database that could be created by going through the entire download-extract-and-cut-with-osmosis process because entities on the border will fall into both the regular and inverted poly files. But it dramatically reduces the amount of data you must download and process. This may or may not be an issue depending on your application.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s