The OpenParcelMap is made from lots of pieces, most of which come directly from the OpenStreetMap ecosystem. To help understand how we expect OPM to work, I prepared this Component Overview inspired by OpenStreetMap’s Component Overview.
Now that we have the rails port up and running, it’s time to get some data in. This was my first foray into imports, and I ended up creating some tools…and learning a lot about why imports are difficult to get right. The following is a description of what I ended up doing. Note that I work on the command line in Ubuntu Linux. If you want to try something similar to this, you may have to adapt these instructions for your platform.
A critical note before we proceed: These steps upload parcel data to the OpenParcelMap. Please please please do not accidentally upload parcel boundaries to OpenStreetMap. In JOSM, you should navigate to Edit->Preferences->Connection Settings and change the OSM Server URL to point to http://opm.ual.berkeley.edu/api. Mac users should expect the preferences menu in the JOSM menu.
- Set some useful environment variables and fetch some source code:
export SRC=$HOME/src # or where ever you like to keep your source code cd $SRC git clone git://github.com/ual/ogr2osm.git git clone git://github.com/ual/bulkyosm.git git clone git://github.com/ual/parceltools.git export PATH=$HOME/src/ogr2osm:$HOME/src/bulkyosm:$PATH export PYTHONPATH=$HOME/src/parceltools/translations # so ogr2osm.py can find our filter
Note that I’m advising you to pull the UAL fork of ogr2osm. I’ve got a patch pending Paul Norman’s approval that you’ll need for this stuff to work. Hopefully you’ll be able to pull his repo instead.
- Fetch the public domain parcel boundary file for the jurisdiction of interest. I pulled a shapefile of San Francisco from https://data.sfgov.org/download/3vyz-qy9p/ZIP.
- Split the shapefile into many smaller shapefiles so that they are easier to process and upload. You should perform this step wherever you prefer to keep your geodata. It creates a subdirectory (called blocks in this case) and populates it with many small shapefiles.
parcelsplit.py -l citylots citylots.shp -b BLOCK_NUM -o blocks
Note that we use the BLOCK_NUM to divide the shapefiles into groups so that we get geographically contiguous globs of parcels. This is important because the raw data has lots of overlapping points and polygons that we want to reduce. So we can’t arbitrarily divide the parcels into globs. Note also that we use the default glob size of 1000. There are a few blocks with more than 1000 parcels, and we get warnings about that.
Convert the small shapefiles to osm files:
cd blocks/ for f in citylots_*.shp; do ogr2osm.py $f -f -t sf.py --no-upload-false; done
Note that if you are uploading data from a different jurisdiction, you will have to prepare a different translator. Consider using sf.py as a model. There are a number of general parcel data geometry fixes (that we should surely factor out).
- Open each file in JOSM and ensure that it passes all the validation tests. To learn how to do this, check the validator documentation. I was able to open about 30 files at a time, each with 500-1000 parcels on my laptop (8GB RAM) without too much pain. If you get warnings or errors, this may mean that the parcelsplit.py script did not catch something that it should have. You can either fix the problem and proceed, or skip that file. In the latter case, you should consider filing a bug against parceltools so we can try to fix it.
- Upload to OpenParcelMap. Please double check that your tools are not pointing to openstreetmap.org for these steps!. Be sure to specify a suitable source=XYZ tag for the change set. Also, you may want to keep the file name of the fragment in the comment so it’s easier to go back later if there’s some problems. For the first few files, I just uploaded the layer to OpenParcelMap from File->Upload Data after validating. This was taking a couple minutes per file. I switched to using bulkyosm (a fork of bulk_upload_06) using this command:
for f in `ls *.osm`; do bulk_upload.py -H opm.ual.berkeley.edu \ -u yourusername -p yourpassword \ -c "initial import of san francisco parcels ($f)" \ -i $f -s "https://data.sfgov.org/Geography/City-Lots-Zipped-Shapefile-Format-/3vyz-qy9p"; \ done
A few critical notes:
- This command uploads ALL of the .osm files in the current directory. So if you already uploaded some of them, or if you don’t wish to upload some of them, you should move them out of the way.
- If you interrupt this process, you’ll probably end up interrupting an upload. The bulk_upload.py script can resume the upload because it tracks what has successfully been sent to the server. But things can get tricky. You should probably head over to opm.ual.berkeley.edu, inspect the history, and decide what to do. You may need to manually close the change set and manually clean it up.
This upload was still terribly slow. I left it running overnight and it only got through about 90 of my 186 osm files. All of the osm files comprise about 60MB of data. I poked around the server side a bit and discovered that ruby was consuming 90 of a CPU and postgres (presumably on behalf of ruby) was consuming another 10%. So perhaps we are CPU bound on the server side somehow. I also noticed that all of the raw xml data ends up in the log files. These issues will have to be resolved before this can be viable.
We are going to experiment with using OSM infrastructure to gather, consolidate, and standardize parcel data. So we need a sandbox where it’s safe to experimentally import data. I brought up an instance of the rails port in order to do this. I did this mostly following the general rails port instructions and the Ubuntu-specific instructions. Then I fussed around with the templates and content in a completely non-upstreamable fashion. Ta da. OpenParcelMap. Many integration steps remain. Like getting planet dumps and tiles in place, for example. A few higher-level issues have also come up as I’ve gone through this process.
One thing is this: I’m starting to believe that the OpenParcelMap will rely heavily on imports of government datasets. This belief has emerged from some chats on firstname.lastname@example.org where people point out that parcel boundaries cannot really be observed by on-the-ground mappers without special information (e.g., a legal description of the parcel), special expertise, and possibly permission to access the property. Also, much parcel data (such as boundaries and zoning) change administratively with no change on the ground at all. So the utility to this parcel data project of OSM as a crowd-sourcing platform is perhaps less compelling than its utility as a version control system for geospatial data and a distribution scheme.
Another issue that is looming on the horizon is the prospect of keeping parcel data up to date. Honestly, I haven’t given this much thought. I’m sure other OSM users have given this sort of problem much more thought than me. But that’s what this rails port is for: experimenting with these issues, trial-by-fire style.
There’s also the matter of coherency with OSM. Parcel data often comes with addresses and building info, which, unlike parcel boundaries, are currently acceptable in OSM. Should we even bother uploading address info into the OpenParcelMap? Or should this just go into OSM? This which-data-goes-where issue has come up in discussions about introducing the notion of layers in the OSM data model. I’m really not sure what to do here.
Finally, there’s the matter of licensing. One of the things I changed in the content is to make it clear that all of the data that goes into the open parcel map must be public domain. This is not a final decision. At this time we’re mulling over the pros and cons of using the ODbL, sticking with public domain, or doing something else. And by all means, if you have are reading this and have some thoughts on the matter, please comment. The only evaluation criteria I’ve come up with for this choice so fare are to maximize the likelihood that government agencies who maintain parcel data, to maximize compatibility with existing OSM data, and/or to maximize the participation from mappers interested in managing regions of parcel data. I suppose my next post will be exploring this issue in more detail.
One of the things about OpenStreetMap that I find fascinating is the prospect of maintaining a routinely updated copy of the map data in your own environment. Below I’ve documented the steps I took to use readily available tools to prepare an updatable database of OSM’s San Francisco data. These steps are inspired by the book “OpenStreetMap: Using and Enhancing the Free Map of the World” by Frederik Ramm, Jochen Topf, and Steve Chilton and on the wiki page for osmosis and other tools, and a bit of mailing list correspondence. I work on an Ubuntu 12.04 machine, typically at the command line. You may have to adapt these steps to work for you.
- Install dependencies.
sudo apt-get install postgresql-9.1 postgresql-9.1-postgis postgresql-contrib
- Pick a working directory. I use ~/osmsf in this example.
mkdir ~/osmsf cd ~/osmsf
- Install osmosis, the OSM swiss army knife. Here are the steps to install the latest stable release:
cd ~/osmsf wget http://bretth.dev.openstreetmap.org/osmosis-build/osmosis-latest.tgz tar xvf osmosis-latest.tgz cp -ra bin lib ~
For this last bit to do the right thing, you’ll need ~/bin in your $PATH. Now check that you can successfully run osmosis:
You also may wish to configure osmosis to run with more memory by default. I set mine to use 16GB.
echo 'JAVACMD_OPTIONS="-Xmx16G"' > ~/.osmosis
- Grab some OSM data. In order to keep download and processing time manageable, I start with the geofabrik california extract.
cd ~/osmsf wget http://download.geofabrik.de/openstreetmap/north-america/us/california.osm.pbf
Note that I’ve pulled the protocol-buffer binary format, which is more compact than the bz2 xml. You should also go poke around the place from which you downloaded the file and check the timestamp of the file. For me, the california extract was created at 26-Feb-2013 01:03. This is important for later.
Set up postgres, if you haven’t already. I usually create a user for myself that has ample rights. Later, if you want an automated process, or a web application, or whatever to be able to access the database with more restrictive permissions, you can add a suitable user.
sudo su postgres createuser -P -r -d $USER
I also create a postgis template DB so that I can create postgis databases as my preferred user.
createdb -E UTF8 template_postgis psql -d postgres -c "UPDATE pg_database SET datistemplate='true' WHERE datname='template_postgis';" psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-1.5/postgis.sql psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-1.5/spatial_ref_sys.sql psql -d template_postgis -c "CREATE EXTENSION hstore;" psql -d template_postgis -c "GRANT ALL ON geometry_columns TO PUBLIC;" psql -d template_postgis -c "GRANT ALL ON geography_columns TO PUBLIC;" psql -d template_postgis -c "GRANT ALL ON spatial_ref_sys TO PUBLIC;" exit
Thanks to Miguel Araujo for this tip.
- Create a suitable database:
createdb -T template_postgis osmsf psql -d osmsf -f ~/src/osmsf/script/pgsnapshot_schema_0.6.sql psql -d osmsf -f ~/src/osmsf/script/pgsnapshot_schema_0.6_action.sql psql -d osmsf -f ~/src/osmsf/script/pgsnapshot_schema_0.6_bbox.sql psql -d osmsf -f ~/src/osmsf/script/pgsnapshot_schema_0.6_linestring.sql psql -d osmsf -f ~/src/osmsf/script/pgsnapshot_load_0.6.sql
You might as well create a database auth file for osmosis now too. It’s just a text file with the following format:
host=localhost database=osmsf user=yourusername password=yourpassword dbType=postgresql
I call my file osmsf.auth. You can also supply these options on the osmosis command line if you prefer. And as always, ponder the implications of passwords floating around your computer in plain text.
- Fetch the ogr2poly utility, which we will use to create poly files from shape files.
wget http://svn.openstreetmap.org/applications/utils/osm-extract/polygons/ogr2poly.py -O ~/bin/ogr2poly.py chmod +x ~/bin/ogr2poly.py
Again, for this to work, ~/bin must be in your path.
- Create a poly file to cut out the region of interest. Note that I pull the san francisco county file from the census website. If you are adapting these instructions to a different region, you can poke around there for suitable shapefiles. Note that these are administrative boundaries, not cartographic boundaries. So in the particular case of San Francisco, the sole polygon in this file ends up including some water.
wget http://www2.census.gov/geo/tiger/TIGER2009/06_CALIFORNIA/06075_San_Francisco_County/tl_2009_06075_cousub.zip unzip tl_2009_06075_cousub.zip python ogr2poly.py tl_2009_06075_cousub.shp mv tl_2009_06075_cousub_0.poly sf.poly
- Write the San Francisco data to the postgres database
osmosis --rb california.osm.pbf --bp file=sf.poly --wp authFile=osmsf.auth
If you’re happy with a static snapshot, you’re done. You can go and fire up your favorite postgres client, or maybe poke around using qgis.
- Create a configuration file for the replication update procedure.
cd ~/osmsf osmosis --rrii
That –rrii stands for “read replication interval init”. This command creates a file in your current working directory called configuration.txt that will be used by osmosis to know the URL where the updates come from. You can also set the maximum number of seconds of OSM data that should be fetched. The comments in the autogenerated config file explain this. Here’s the contents of the configuration file I prepared to do this:
baseUrl=http://planet.openstreetmap.org/replication/minute/ maxInterval = 120
- Fetch the state file needed by osmosis to know the state of the extract that you used. Remember earlier when I told you to note the timestamp on the extract file? This is where it comes in handy. Go to the replicate-sequences tool and plug in the timestamp value. Paste the text that you get back into a file called state.txt in your ~/osmsf directory.
- Fetch and apply updates
osmosis --rri --simc --wpc authFile=osmsf.auth
Note the simc, which stands for –simplify-change. This option condenses all of the changes to a particular entity into one change. Without this option, you’ll get errors regarding duplicate keys. Also note that the state.txt file will be updated by this command such that subsequent invocations will fetch the next minutely diffs. Finally, this command applies all of the changes in the changeset, even though we only want the ones that fall within San Francisco. So we’ll have to trim the DB back down to the desired region. The strategy is to create a change set containing all of the points outside of our polygon of interest, then subtract these changes from the database.
- Prepare an empty osm file. I call my empty.osm and here are its contents:
<?xml version="1.0" encoding="UTF-8"?> <osm version="0.6" generator="CGImap 0.0.2"> </osm>
- Now prepare an inverted version of the sf.poly polygon file. To do this, make a copy of sf.poly called sf-inv.poly and open that file in a text editor. Prepend a polygon of the entire earth at the beginning of the file, then invert and renumber all of the existing polygons. The modified file looks like this:
tl_2009_06075_cousub_0 1 -1.800000E+02 1.800000E+02 1.800000E+02 1.800000E+02 1.800000E+02 -1.800000E+02 -1.800000E+02 -1.800000E+02 END !2 -1.230011E+02 3.777205E+01 -1.229975E+02 3.777078E+01 ...
The resulting file is a polygon representing the entire planet except for san francisco. You may wish to go study poly files if any of this is unclear.
- Now you can trim the database to only contain the san francisco data:
osmosis --rx empty.osm --rp authFile=osmsf.auth --dd --bp file=sf-inv.poly --dc --wpc authFile=osmsf.auth
Now some explanation. The –rx reads the empty.osm and turns it into an entity stream. The –rp reads the database as a dataset, then the –dd turns the dataset into an entity stream. So now we have two input entity streams. The –bp (for “bounding polygon”) cuts out anything not with the sf-inv.poly out of the entity stream from the database. So now we have two entity streams, an empty one, and one with all of the entities from the database that lie outside of san francisco. –dc (for “derive change”) creates a single change stream from these two entity streams. Finally, –wpc applies the changes back to the database. If you don’t really understand the difference between datasets, entity streams, and change streams, have a look at the osmosis documentation.
- And repeat. You can now repeat steps 12 and 15 over and over to fetch the latest updates and trim the db back down to size. These steps can also be automated using cron or similar.
Now, this method has some imperfections. Specifically, it may not be identical to the database that could be created by going through the entire download-extract-and-cut-with-osmosis process because entities on the border will fall into both the regular and inverted poly files. But it dramatically reduces the amount of data you must download and process. This may or may not be an issue depending on your application.
We really want a nationwide consolidated, standard parcel database to build upon. Such products are available from numerous proprietary data vendors who make it their business to routinely gather and consolidate data from local government agencies around the country. Of course these are often expensive and have restrictions on redistribution. Our federal government has a clearly stated and persistent vision of creating a nationwide public domain parcel database, and has made notable albeit slow progress towards this goal. Many states have managed to consolidate parcel data (e.g., Massachusetts, Montana). This is very helpful, but plenty of work is required to adapt tools or research from one state to another. And many states have no such offering. As a result, parcel data users for whom proprietary sources are too restrictive or expensive go about manually gathering the data from county agencies. If the application doesn’t span county lines, and if the county is open with their data, this may not be a problem. But these two conditions are often not both met, driving a more intensive data gathering effort. Such efforts are often duplicated for different projects.
Even when parcel data is made available openly, it often varies dramatically in quality and consistency. Some of these defects require local knowledge to be corrected. For example, if the number of dwelling units for a specific parcel is absent or implausible, this information could be corrected by an observer on the ground if a suitable interface were available. The same interface could be used by multiple organizations and individuals who use parcel data to integrate whole county datasets. These users could benefit from any tools or processes that grow around this open data. Obviously (or maybe not obviously), what I am proposing is inspired by OpenStreetMap (OSM), the wiki street map of the world that has been built on these principles. Some digging through the OSM mailing lists reveals some often controversial cases in which whole parcel datasets have been contributed to OSM. The concerns include some limitations on the existing OSM tool chain, reconciling future bulk updates from a jurisdiction with user-edited data, and the practical limitations of on-the-ground users improving or validating parcel boundaries. A subsequent correspondence that we initiated revealed strong interest in open parcel data from the OSM community, but mixed opinions about whether OSM was a suitable repository for such an effort. In light of this, we have decided to pursue the development of a separate open parcel data repository. So it begins.