Dec 23rd, 2021: [en] Enriching documents with the geo_match policy

:speaking_head: This article is also available in Spanish

As you may know, on September 19th of 2021, the Cumbre Vieja volcano eruption started in La Palma, Canary Islands, Spain. The island's government almost immediately began to publish drone surveys of the lava flow footprints in an Open Data Portal (example). I decided to upload them to the Elastic Stack to visualize and understand the evolution of the flows.

Apart from the footprints, there is also a dataset in the Open Data portal with the island buildings. I saw an excellent opportunity to conflate the lava flows and the buildings to see how the lava affects dwellings, factories, etc.

Presenting the geo_match policy for the enrich processor

The enrich processor can be used in an ingest pipeline to add data from a reference index to an ingested document based on a target field or by a geospatial relationship. With a geo_match policy, given a reference index with a polygon field, we can transfer data from those polygons into the ingested document based on a defined spatial relationship. Typically we want to use the intersection relationship. So, for example, given a reference index with world countries, we can transfer the country name to an ingested document with a geometry. You can learn more about this policy on this example from Elasticsearch documentation and a longer tutorial on this blog post on how to use the policy for reverse geocoding.

Our reference index will be the lava footprints, and the buildings will be the enriched documents.

Setting up the lava footprints

The first thing to address was uploading the footprints to be used with the geo_match policy. As the eruption advances, the area affected by the lava flows only grows. Thus, in any given location, several footprints would fall under it, but we only want to transfer the first surveyed footprint.

The data upload needed to be automated to adapt the format and fix some issues. Inside this process, I added a step to compute the difference between a footprint and its predecessor, adding this new geometry (like a ribbon) to the document uploaded as a new geo_shape field named diff_geometry. This way, any location can only intersect with one document of the index.

This is the snippet that concatenates the shapely library difference and simplify methods to produce a geometry difference:

Uploading the buildings

The data processing script uploads the buildings if they are not present in the cluster. After downloading the dataset and ensuring the geometries are valid, the bulk helper from the Elastirsearch python client pushes the data to the cluster.

Defining the enrich policy and ingest pipeline

We store the footprints in an index called lapalma, and we want to transfer to the buildings the fields id and `timestamp' so the policy to define would be:

PUT /_enrich/policy/lapalma_lookup
{
  "geo_match": {
    "indices": "lapalma",
    "match_field": "diff_geometry",
    "enrich_fields": ["id", "timestamp"],
  }
}

The next step will be to execute the policy:

POST /_enrich/policy/lapalma_lookup/_execute

:warning: Important! :warning:: We need to execute the policy every time we update the index with new footprints.

With the policy ready, we can add it to an ingest pipeline that will expect a geometry field to intersect with the geometries from the policy. The policy will create a new object field called footprints to store the identifier and the timestamp from the reference index. The ingest pipeline will remove the geometry that is also, by default, transferred to the enriched document.

PUT _ingest/pipeline/buildings_footprints
{
  "description": "Enrich buildings with Cumbre Vieja footprints.",
  "processors": [
    {
      "enrich": {
        "field": "geometry",
        "policy_name": "lapalma_lookup",
        "target_field": "footprints",
        "shape_relation": "INTERSECTS",
        "ignore_missing": True,
        "ignore_failure": True,
      }
    },
    {
      "remove": {
        "field": "footprints.diff_geometry",
        "ignore_missing": True,
        "ignore_failure": True,
        "description": "Remove the shape field",
      }
    },
  ],
}

Running the ingest pipeline

Now that we have our ingest pipeline, we are going to update the buildings in place with a _update_by_query request. We won't run it against the full index, instead we will search for the documents that satisfy two conditions and pass them through the pipeline

  1. They are inside a bounding box of the affected area
  2. They have no footprint assigned yet
POST buildings/_update_by_query?pipeline=buildings_footprints
{
  "query": {
    "bool": {
      "must_not": [ { "exists": { "field": "footprints.id" } } ],
      "filter": {
        "geo_bounding_box": {
          "geometry": {
            "top_left": { "lat": 28.647, "lon": -17.95 },
            "bottom_right": { "lat": 28.58, "lon": -17.83 },
          }
        }
      },
    }
  }
}

Putting all together

I used Kibana DevTools syntax on this post for readability, but all these requests run from a python script in a Github Action on every push that detects a change in the footprints identifiers script. Unfortunately, there is no way to automatically detect the publishing of the new datasets in the Open Data portal, so these identifiers need to be added manually to trigger the data update (example commit).

With the datasets in the cluster, we can explore them with Elastic Maps Time Slider and with a Kibana Dashboard.

05-animation-buildings
https://ela.st/cumbre-vieja-eruption-map


https://ela.st/cumbre-vieja-eruption

Have fun! :wave:

6 Likes