Best practices on indexing/querying nested document in elasticsearch

Team,

I am reaching out to you regarding best practices on indexing/querying elasticsearch in order to minimize latencies we are seeing with our cluster.

We had a use-case where we needed to ingest data which is nested in nature and is continuously changing at source.

Considering nested queries expensive as per elasticsearch documentation (https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html)

we decided to covert nested document into flat document for indexing. Example below:

Nested document:

    {

    restaurantId: 1,

    name: example,

    shift: [

    {

    start: date(),

    end: date(),

    person: A

    },

    {

    start: date(),

    end: date(),

    person: B

    }

    ]

    }

Flat documents:

    {

    restaurantId: 1,

    name: example,

    shift_start: “2020-05-01T08:00:00”,

    shift_end: “2020-05-01T10:00:00”,

    person: A

    }

    {

    restaurantId: 1,

    name: example,

    shift_start: “2020-05-01T10:00:00”,

    shift_end: “2020-05-01T12:00:00”,

    person: B

    }

This presents 2 challenges, first at ingestion layer and second at querying layer.

Ingestion layer: Deleting and reindexing document rather than simply updating document based on _id field

Since data is continuously changing at source, we are updating our index every 30 seconds based on changes at document level. Rather than using PUT to update a document we have to delete and re-insert document because we don’t know unique _id before the document is ingested in elasticsearch. For example: If restaurant name is changed to example2 for restaurantId 1 we are first deleting all the documents related to restuarantId 1 using delete by query and then post new document with updated information.

Querying layer: Uniquely identifying document if nested attributes are only used for filtering purposes and not exposed to client

We will be using nested attributes (shift_start, shift_end, person) only for filtering purposes and they won’t be returned as part of response to the client queries.

For example if I perform search on above documents with query : give me all records where shift_start greater than 2020-05-01T00:00:00 then elasticsearch will return both the documents.

While to the client we will return 2 records with same attributes value combination (restaurantId and name). To deal with this situation we decided to use composite aggregation to aggregate results based restaurantId and name.

Cluster metrics:

Index size: 6 GiB

Number of documents: 18,743,175

Shards (automatically calculated by elasticsearch): 22 shards (11 primary, 11 replicas)

Number of data nodes: 3

Number of dedicated nodes: 3

Instance type: r5.large.elasticsearch

Data nodes storage type: EBS

EBS volume type: General Purpose (SSD)

EBS volume size: 50 GiB

With above setup, ingestion of around 1,00,000 documents is taking around 17 seconds and querying the data using composite aggregation is taking more than 3 seconds if query has more filters and more than 20 seconds for * query.

What you have done to optimize above setup?

Considering above scenario and data requirements any suggestions on how we can improve our query latencies. We are expecting query latencies to be under 2 seconds.

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.