Dec 19th, 2019 [EN][Elasticsearch] Simplifying Ingest Pipelines with the new Enrich Processor

When ingesting data through a regular Elasticsearch ingest pipeline (e.g. with dissect, rename, or remove processors) we can now add an Enrich Processor.

This allows us to do lookups on other Elasticsearch’s indices and enrich the incoming document before sending it to its own index.

Enrich Policies are what powers this new functionality in Elasticsearch version 7.5., and at this point, we can enrich data based on geolocation or through matching exact values using a term query.

Let’s use an example to demonstrate this new capability.

I’m an avid reader at my local library. They have ingested their catalog in Elasticsearch, and each book is searchable in the index book-catalog with book title, author and ISBN (the International Standard Book Number). This is a sample of the catalog books:

POST book-catalog/_doc/
{
    "isbn-10": "0571276547",
    "author": "Paul Auster",
    "title": "The Brooklyn Follies"
}

POST book-catalog/_doc/
{
    "isbn-10": "0571283209",
    "author": "Paul Auster",
    "title": "Winter Journal"
}

POST book-catalog/_doc/
{
    "isbn-10": "2298068968",
    "author": "Marc Levy",
    "title": "Un sentiment plus fort que la peur"
}
    
POST book-catalog/_doc/
{ 
    "isbn-10": "086068511",
    "author": "Maya Angelou",
    "title": "I Know Why The Caged Bird Sings"
}

POST book-catalog/_doc/
{
    "isbn-10": "163286696",
    "author": "James Rhodes",
    "title": "Instrumental"
}

Each time someone borrows a book, the library creates a document in another index, let’s call it book-lending, with the user id, which we will imagine it’s the user’s e-mail, and the book ISBN. For example:

POST book-lending/_doc
{
    "isbn-10": "163286696",
    "user": "mim@mail.net"
}

Later on, they would like to explore what are the user’s favorite authors. Or, whenever a new book is acquired from an author, inform users who have borrowed books from the same author in the past.

What will help my librarians get that information in a single query, involves enriching the book lending data at ingest time in a way that the book-lending document holds not only the ISBN code for a book but also the name of the book and author.

This can be achieved as follows.

  1. Create an Enrich Policy that matches the isbn-10 field from the incoming document to the book-catalog index.

    PUT /_enrich/policy/book-catalog
    {
      "match": {
        "indices": "book-catalog",
        "match_field": "isbn-10",
        "enrich_fields": ["author", "title"]
      }
    }
    
  2. Execute the previous policy.

    PUT /_enrich/policy/book-catalog/_execute
    

    Policies must be executed before any enrich processor can make use of them. Elasticsearch will create a system index, the enrich index, that the processor will use to enrich incoming documents.

    Let’s check it out:

    GET _cat/indices/.enrich-*?v&s=index&h=index,docs.count,store.size
    
    index                              docs.count store.size
    .enrich-book-catalog-1575834789863          5      6.8kb
    

    Enrichment indices are read-only and force merged for fast retrieval. We will have to execute the policy again if new data is ingested into the enrichment index, and a new enrichment index will be created.

  3. Create an ingest pipeline that makes use of that enrich policy. In the example, we are using a matching that will add a field called book-details to the incoming book lending documents, which will hold the book’s author and title based on its isbn-10 code.

    PUT _ingest/pipeline/enrich_book_lending
    {
      "processors": [
        {
          "enrich": {
            "policy_name": "book-catalog",
            "field": "isbn-10",
            "target_field": "book-details"
          }
        }
      ]
    }
    
  4. Ingest the documents in the book-lending using the pipeline created in the step above.

    POST book-lending/_doc?pipeline=enrich_book_lending
    {
      "isbn-10": "163286696",
      "user": "mim@mail.net"
    }
    
    POST book-lending/_doc?pipeline=enrich_book_lending
    {
      "isbn-10": "0571276547",
      "user": "mim@mail.net"
    }
    
    POST book-lending/_doc?pipeline=enrich_book_lending
    {
      "isbn-10": "2298068968",
      "user": "mim@mail.net"
    }
    
    POST book-lending/_doc?pipeline=enrich_book_lending
    {
      "isbn-10": "086068511",
      "user": "mim@mail.net"
    }
    
    POST book-lending/_doc?pipeline=enrich_book_lending
    {
      "isbn-10": "0571283209",
      "user": "mim@mail.net"
    }
    

If we now have a look at the documents ingested, we will see the book information was added under the field book-details:

GET book-lending/_search?filter_path=hits.hits._source

{
    "hits" : {
    "hits" : [
        {
        "_source" : {
            "isbn-10" : "163286696",
            "book-details" : {
            "isbn-10" : "163286696",
            "author" : "James Rhodes"
            "title" : "Instrumental"
            },
            "user" : "mim@mail.net"
        }
        },
…

And thus, my librarians could find out what users like to read “Paul Auster” and send them an e-mail when a new book comes in:

GET book-lending/_search?filter_path=hits.hits._source.user
{
    "query": {
    "bool": {
        "filter": [
            { "term":  { "book-details.author.keyword": "Paul Auster" }}      ]
    }
    },
    "collapse": {
    "field": "user.keyword"
    }
}
__________________
{
    "hits" : {
    "hits" : [
        {
        "_source" : {
            "user" : "mim@mail.net"
        }
        }
    ]
    }
}

Or what is the favorite author for each user:

GET book-lending/_search?filter_path=aggregations.top_author.buckets
{
    "size": 0,
    "query": {
    "bool": {
        "filter": 
            { "term": { "user.keyword": "mim@mail.net" }}
    }
    },
    "aggs": {
    "top_author": {
        "terms": {
        "field": "book-details.author.keyword",
        "size": 1
        }
    }
    }
}
____________________________

{
    "aggregations" : {
    "top_author" : {
        "buckets" : [
        {
            "key" : "Paul Auster",
            "doc_count" : 2
        }
        ]
    }
    }
}

Enjoy reading books and take the enrich processor for a spin!

2 Likes