Skip to content

Transform OSM (history) PBF files into GeoParquet. Enrich with OSM changeset metadata and country information.

License

Notifications You must be signed in to change notification settings

GIScience/ohsome-planet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ohsome-planet

Build Status Sonarcloud Status LICENSE status: active

The ohsome-planet tool can be used to:

  1. Transform OSM (history) PBF files into Parquet format with native GEO support.
  2. Turn an OSM changeset file (osm.bz2) into a PostgreSQL database table.
  3. Keep both datasets up-to-date by ingesting OSM planet replication files.

ohsome-planet creates the actual OSM elements geometries for nodes, ways and relations. It enriches each element with changeset data such as hastags, OSM editor or username. Additionally it is possibly to add country ISO codes to each element by providing a boundary dataset.

The output of ohsome-planet can be used to perform a wide range of geospatial analyses with tools such as DuckDB, Python GeoPandas or QGIS. Its also possible to display the data directly on a map and explore it.

Installation

Installation requires Java 21.

First, clone the repository and its submodules. Then, build it with Maven.

git clone --recurse-submodules https://github.com/GIScience/ohsome-planet.git
cd ohsome-planet
./mvnw clean package -DskipTests

Usage

To see the help page of the ohsome-planet CLI run:

java -jar ohsome-planet-cli/target/ohsome-planet.jar --help

Tutorial

In this tutorial we will run the three main modes of ohsome-planet:

  1. Contributions: OSM Extract (.pbf) --> Parquet
  2. Changesets: OSM Changesets (.bz2) --> PostgreSQL
  3. Replication: OSM Replication Files (.osc) --> Parquet / PostgreSQL

Contributions (Parquet)

Transform OSM (history/latest) .pbf file into Parquet format.

You can download the latest or history OSM extract (osm.pbf) for the whole planet from the OSM Planet server or for small regions from Geofabrik.

Throughout this tutorial we are going to use a small extract of OSM for Karlsruhe from Geofabrik (karlsruhe-regbez-latest.osm.pbf). Karlsruhe is a city in Germany.

To process any given .pbf file, you need to run ohsome-planet with the contributions command and at least the --pbf and data (data output directory) arguments:

java -jar ohsome-planet-cli/target/ohsome-planet.jar \
    contributions \
    --data data/ \
    --pbf karlsruhe-regbez-latest.osm.pbf

Additional arguments like --parallel, --country-file, --changeset-db and --overwrite are optional. Find out more about these on the documentation site of the CLI or the help text of the CLI:

java -jar ohsome-planet-cli/target/ohsome-planet.jar \
    contributions \
    --help

When using a history PBF file, the output files are split into history and latest contributions. All contributions which are a) not deleted and b) visible in OSM at the timestamp of the extract are considered as latest. The remaining contributions (deleted or old versions) are considered as history.

The number of threads (--parallel parameter) defines the number of files which will be created.

To see the files created and the directory structure run:

tree data/contributions
data/contributions/
├── history
│   ├── relation-0-history-contribs.parquet
│   └── ...
└── latest
    ├── node-0-latest-contribs.parquet
    ├── relation-0-latest-contribs.parquet
    ├── way-0-latest-contribs.parquet
    └── ...

To explore the data with DuckDB run:

duckdb -s "DESCRIBE FROM read_parquet('data/contributions/*/*.parquet');"
┌───────────────────┬─────────────────────────────────────────────────────────┐
│    column_name    │                       column_type                       │
│      varchar      │                         varchar                         │
├───────────────────┼─────────────────────────────────────────────────────────┤
│ status            │ VARCHAR                                                 │
│ valid_from        │ TIMESTAMP WITH TIME ZONE                                │
│ valid_to          │ TIMESTAMP WITH TIME ZONE                                │
│ osm_type          │ VARCHAR                                                 │
│ osm_id            │ BIGINT                                                  │
│ osm_version       │ INTEGER                                                 │
│ osm_minor_version │ INTEGER                                                 │
│ osm_edits         │ INTEGER                                                 │
│ osm_last_edit     │ TIMESTAMP WITH TIME ZONE                                │
│ user              │ STRUCT(id INTEGER, "name" VARCHAR)                      │
│ tags              │ MAP(VARCHAR, VARCHAR)                                   │
│ tags_before       │ MAP(VARCHAR, VARCHAR)                                   │
│ changeset         │ STRUCT(id BIGINT, created_at TIMESTAMP WITH TIME ZONE…  │
│ bbox              │ STRUCT(xmin DOUBLE, ymin DOUBLE, xmax DOUBLE, ymax DO…  │
│ centroid          │ STRUCT(x DOUBLE, y DOUBLE)                              │
│ xzcode            │ STRUCT("level" INTEGER, code BIGINT)                    │
│ geometry_type     │ VARCHAR                                                 │
│ geometry          │ BLOB                                                    │
│ area              │ DOUBLE                                                  │
│ area_delta        │ DOUBLE                                                  │
│ length            │ DOUBLE                                                  │
│ length_delta      │ DOUBLE                                                  │
│ contrib_type      │ VARCHAR                                                 │
│ refs_count        │ INTEGER                                                 │
│ refs              │ BIGINT[]                                                │
│ members_count     │ INTEGER                                                 │
│ members           │ STRUCT("type" VARCHAR, id BIGINT, "timestamp" TIMESTA…  │
│ countries         │ VARCHAR[]                                               │
│ build_time        │ BIGINT                                                  │
├───────────────────┴─────────────────────────────────────────────────────────┤
│ 29 rows                                                                     │

To explore the data with QGIS run:

qgis data/contributions/latest/

Changesets (PostgreSQL)

Import OSM changesets .bz2 file to PostgreSQL.

First, create an empty PostgreSQL database with PostGIS extension or provide a connection to an existing database. For instance, you can set it up like this.

export OHSOME_PLANET_DB_USER=ohsomedb
export OHSOME_PLANET_DB_PASSWORD=mysecretpassword

docker run -d \
    --name ohsome_planet_changeset_db \
    -e POSTGRES_PASSWORD=$OHSOME_PLANET_DB_PASSWORD \
    -e POSTGRES_USER=$OHSOME_PLANET_DB_USER \
    -p 5432:5432 \
    postgis/postgis

Second, download the full changeset file from the OSM Planet server. If you want to clip the extent to a smaller region, you can use the changeset-filter command of the osmium library. This might take a few minutes. Currently, there is no provider for pre-processed or regional changeset file extracts.

osmium changeset-filter \
    --bbox=8.319,48.962,8.475,49.037 \
    --output=changesets-latest-karlsruhe-regbez.osm.bz2 \
    changesets-latest.osm.bz2

Then, process the OSM changesets .bz2 file like in the following example.

java -jar ohsome-planet-cli/target/ohsome-planet.jar \
    changesets \
    --bz2 changesets-latest-karlsruhe-regbez.osm.bz2 \
    --changeset-db "jdbc:postgresql://localhost:5432/postgres?user=$OHSOME_PLANET_DB_USER&password=$OHSOME_PLANET_DB_PASSWORD" \
    --create-tables \
    --overwrite

The parameters --create-tables and --overwrite are optional. Find more detailed information on usage here: docs/CLI.md. To see all available parameters, call the tool with --help parameter.

Replications (Parquet / PostgreSQL)

Transform OSM replication .osc files into parquet format.

Keep changeset PostgreSQL database up-to-date.

The ohsome-planet tool can also be used to generate updates from the replication files provided by the OSM Planet server. Geofabrik also provides updates for regional extracts.

If you want to update both datasets your command should look like this:

java -jar ohsome-planet-cli/target/ohsome-planet.jar replications \
    --data path/to/data \
    --changeset-db "jdbc:postgresql://localhost:5432/postgres?user=your_user&password=your_password" \
    --parallel 8 \
    --country-file data/world.csv \
    --parquet-data path/to/parquet/output/ \
    --continue

Just like for the contributions command you can use the optional parameters --parallel, --country-file, --parquet-data arguments here as well. The optional --continue flag can be used to make the update process run as a continuous service, which will wait and fetch new changes from the OSM planet server. If you want to only update changesets you can use the --just-changesets flag. You can do the same for contributions with --just-contributions.

Find more detailed information on usage here: docs/CLI.md. To see all available parameters, call the tool with --help parameter.

Contributions will be written as Parquet files matching those found in the replication source. This mimics the structure of the OSM Planet Server. You can use the top level state files (state.txt or state.csv) to find the most recent sequence number.

/data/ohsome-planet/berlin
└── updates
    ├── 006
    │   ├── 942
    │   │   ├── 650.opc.parquet
    │   │   ├── 650.state.txt
    │   │   ├── ...
    │   │   ├── 001.opc.parquet
    │   │   └── 001.state.txt
    │   ├── 941
    │   ├── ...
    │   └── 001
    ├── state.csv
    └── state.txt

Inspect Results

You can inspect your results easily using DuckDB. Take a look at our collection of useful queries to find many analysis examples.

-- list all columns
DESCRIBE FROM read_parquet('contributions/*/*.parquet');

-- result
┌───────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│    column_name    │                                                                                  column_type                                                                                   │  null   │   key   │ default │  extra  │
│      varcharvarcharvarcharvarcharvarcharvarchar │
├───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ status            │ VARCHAR                                                                                                                                                                        │ YES     │ NULLNULLNULL    │
│ valid_from        │ TIMESTAMP WITH TIME ZONE                                                                                                                                                       │ YES     │ NULLNULLNULL    │
│ valid_to          │ TIMESTAMP WITH TIME ZONE                                                                                                                                                       │ YES     │ NULLNULLNULL    │
│ osm_type          │ VARCHAR                                                                                                                                                                        │ YES     │ NULLNULLNULL    │
│ osm_id            │ BIGINT                                                                                                                                                                         │ YES     │ NULLNULLNULL    │
│ osm_version       │ INTEGER                                                                                                                                                                        │ YES     │ NULLNULLNULL    │
│ osm_minor_version │ INTEGER                                                                                                                                                                        │ YES     │ NULLNULLNULL    │
│ osm_edits         │ INTEGER                                                                                                                                                                        │ YES     │ NULLNULLNULL    │
│ osm_last_edit     │ TIMESTAMP WITH TIME ZONE                                                                                                                                                       │ YES     │ NULLNULLNULL    │
│ user              │ STRUCT(id INTEGER, "name" VARCHAR)                                                                                                                                             │ YES     │ NULLNULLNULL    │
│ tags              │ MAP(VARCHAR, VARCHAR)                                                                                                                                                          │ YES     │ NULLNULLNULL    │
│ tags_before       │ MAP(VARCHAR, VARCHAR)                                                                                                                                                          │ YES     │ NULLNULLNULL    │
│ changeset         │ STRUCT(id BIGINT, created_at TIMESTAMP WITH TIME ZONE, closed_at TIMESTAMP WITH TIME ZONE, tags MAP(VARCHAR, VARCHAR), hashtags VARCHAR[], editor VARCHAR, numChanges INTEGER) │ YES     │ NULLNULLNULL    │
│ bbox              │ STRUCT(xmin DOUBLE, ymin DOUBLE, xmax DOUBLE, ymax DOUBLE)                                                                                                                     │ YES     │ NULLNULLNULL    │
│ centroid          │ STRUCT(x DOUBLE, y DOUBLE)                                                                                                                                                     │ YES     │ NULLNULLNULL    │
│ xzcode            │ STRUCT("level" INTEGER, code BIGINT)                                                                                                                                           │ YES     │ NULLNULLNULL    │
│ geometry_type     │ VARCHAR                                                                                                                                                                        │ YES     │ NULLNULLNULL    │
│ geometry          │ GEOMETRY                                                                                                                                                                       │ YES     │ NULLNULLNULL    │
│ area              │ DOUBLE                                                                                                                                                                         │ YES     │ NULLNULLNULL    │
│ area_delta        │ DOUBLE                                                                                                                                                                         │ YES     │ NULLNULLNULL    │
│ length            │ DOUBLE                                                                                                                                                                         │ YES     │ NULLNULLNULL    │
│ length_delta      │ DOUBLE                                                                                                                                                                         │ YES     │ NULLNULLNULL    │
│ contrib_type      │ VARCHAR                                                                                                                                                                        │ YES     │ NULLNULLNULL    │
│ refs_count        │ INTEGER                                                                                                                                                                        │ YES     │ NULLNULLNULL    │
│ refs              │ BIGINT[]                                                                                                                                                                       │ YES     │ NULLNULLNULL    │
│ members_count     │ INTEGER                                                                                                                                                                        │ YES     │ NULLNULLNULL    │
│ members           │ STRUCT("type" VARCHAR, id BIGINT, "timestamp" TIMESTAMP WITH TIME ZONE, "role" VARCHAR, geometry_type VARCHAR, geometry BLOB)[]                           │ YES     │ NULLNULLNULL    │
│ countries         │ VARCHAR[]                                                                                                                                                                      │ YES     │ NULLNULLNULL    │
│ build_time        │ BIGINT                                                                                                                                                                         │ YES     │ NULLNULLNULL    │
├───────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┤
│ 29 rows                                                                                                                                                                                                                          6 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Getting Started as Developer

This is a list of resources that you might want to take a look at to get a better understanding of the core concepts used for this project. In general, you should gain some understanding of the raw OSM (history) data format and know how to build geometries from nodes, ways and relations. Furthermore, knowledge about (Geo)Parquet files is useful as well.

What is the OSM PBF File Format?

What is parquet?

What is RocksDB?

  • RocksDB is a storage engine with key/value interface, where keys and values are arbitrary byte streams. It is a C++ library. It was developed at Facebook based on LevelDB and provides backwards-compatible support for LevelDB APIs.
  • https://github.com/facebook/rocksdb/wiki

How to build OSM geometries (for multipolygons)?

Further Notes

  • For relations that consist of more than 500 members we skip MultiPolygon geometry building and fall back to GeometryCollection. Check MEMBERS_THRESHOLD in ohsome-contributions/src/main/java/org/heigit/ohsome/contributions/contrib/ContributionGeometry.java.
  • For contributions with status deleted we use the geometry of the previous version. This allows you to spatially filter also for deleted elements, e.g. by bounding box. In the sense of OSM deleted elements do not have any geometry.

About

Transform OSM (history) PBF files into GeoParquet. Enrich with OSM changeset metadata and country information.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 10

Languages