Importing San Francisco Parcel Boundaries

Now that we have the rails port up and running, it’s time to get some data in. This was my first foray into imports, and I ended up creating some tools…and learning a lot about why imports are difficult to get right. The following is a description of what I ended up doing. Note that I work on the command line in Ubuntu Linux. If you want to try something similar to this, you may have to adapt these instructions for your platform.

A critical note before we proceed: These steps upload parcel data to the OpenParcelMap. Please please please do not accidentally upload parcel boundaries to OpenStreetMap. In JOSM, you should navigate to Edit->Preferences->Connection Settings and change the OSM Server URL to point to http://opm.ual.berkeley.edu/api. Mac users should expect the preferences menu in the JOSM menu.

  • Set some useful environment variables and fetch some source code:
    export SRC=$HOME/src # or where ever you like to keep your source code
    cd $SRC
    git clone git://github.com/ual/ogr2osm.git
    git clone git://github.com/ual/bulkyosm.git
    git clone git://github.com/ual/parceltools.git
    export PATH=$HOME/src/ogr2osm:$HOME/src/bulkyosm:$PATH
    export PYTHONPATH=$HOME/src/parceltools/translations # so ogr2osm.py can find our filter
    

    Note that I’m advising you to pull the UAL fork of ogr2osm. I’ve got a patch pending Paul Norman’s approval that you’ll need for this stuff to work. Hopefully you’ll be able to pull his repo instead.

  • Fetch the public domain parcel boundary file for the jurisdiction of interest. I pulled a shapefile of San Francisco from https://data.sfgov.org/download/3vyz-qy9p/ZIP.
  • Split the shapefile into many smaller shapefiles so that they are easier to process and upload. You should perform this step wherever you prefer to keep your geodata. It creates a subdirectory (called blocks in this case) and populates it with many small shapefiles.
    parcelsplit.py -l citylots citylots.shp -b BLOCK_NUM -o blocks
    

    Note that we use the BLOCK_NUM to divide the shapefiles into groups so that we get geographically contiguous globs of parcels. This is important because the raw data has lots of overlapping points and polygons that we want to reduce. So we can’t arbitrarily divide the parcels into globs. Note also that we use the default glob size of 1000. There are a few blocks with more than 1000 parcels, and we get warnings about that.

  • Convert the small shapefiles to osm files:

    cd blocks/
    for f in citylots_*.shp; do ogr2osm.py $f -f -t sf.py --no-upload-false; done
    

    Note that if you are uploading data from a different jurisdiction, you will have to prepare a different translator. Consider using sf.py as a model. There are a number of general parcel data geometry fixes (that we should surely factor out).

  • Open each file in JOSM and ensure that it passes all the validation tests. To learn how to do this, check the validator documentation. I was able to open about 30 files at a time, each with 500-1000 parcels on my laptop (8GB RAM) without too much pain. If you get warnings or errors, this may mean that the parcelsplit.py script did not catch something that it should have. You can either fix the problem and proceed, or skip that file. In the latter case, you should consider filing a bug against parceltools so we can try to fix it.
  • Upload to OpenParcelMap. Please double check that your tools are not pointing to openstreetmap.org for these steps!. Be sure to specify a suitable source=XYZ tag for the change set. Also, you may want to keep the file name of the fragment in the comment so it’s easier to go back later if there’s some problems. For the first few files, I just uploaded the layer to OpenParcelMap from File->Upload Data after validating. This was taking a couple minutes per file. I switched to using bulkyosm (a fork of bulk_upload_06) using this command:
    for f in `ls *.osm`; do bulk_upload.py -H opm.ual.berkeley.edu \
         -u yourusername -p yourpassword \
         -c "initial import of san francisco parcels ($f)" \
         -i $f -s "https://data.sfgov.org/Geography/City-Lots-Zipped-Shapefile-Format-/3vyz-qy9p"; \
    done
    

    A few critical notes:

    • This command uploads ALL of the .osm files in the current directory. So if you already uploaded some of them, or if you don’t wish to upload some of them, you should move them out of the way.
    • If you interrupt this process, you’ll probably end up interrupting an upload. The bulk_upload.py script can resume the upload because it tracks what has successfully been sent to the server. But things can get tricky. You should probably head over to opm.ual.berkeley.edu, inspect the history, and decide what to do. You may need to manually close the change set and manually clean it up.

This upload was still terribly slow. I left it running overnight and it only got through about 90 of my 186 osm files. All of the osm files comprise about 60MB of data. I poked around the server side a bit and discovered that ruby was consuming 90 of a CPU and postgres (presumably on behalf of ruby) was consuming another 10%. So perhaps we are CPU bound on the server side somehow. I also noticed that all of the raw xml data ends up in the log files. These issues will have to be resolved before this can be viable.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s