Best practices on indexing/querying nested document in elasticsearch

Deepanshu_Gupta · September 15, 2020, 10:49pm

Team,

I am reaching out to you regarding best practices on indexing/querying elasticsearch in order to minimize latencies we are seeing with our cluster.

We had a use-case where we needed to ingest data which is nested in nature and is continuously changing at source.

Considering nested queries expensive as per elasticsearch documentation (https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html)

we decided to covert nested document into flat document for indexing. Example below:

Nested document:

    {

    restaurantId: 1,

    name: example,

    shift: [

    {

    start: date(),

    end: date(),

    person: A

    },

    {

    start: date(),

    end: date(),

    person: B

    }

    ]

    }

Flat documents:

    {

    restaurantId: 1,

    name: example,

    shift_start: “2020-05-01T08:00:00”,

    shift_end: “2020-05-01T10:00:00”,

    person: A

    }

    {

    restaurantId: 1,

    name: example,

    shift_start: “2020-05-01T10:00:00”,

    shift_end: “2020-05-01T12:00:00”,

    person: B

    }

This presents 2 challenges, first at ingestion layer and second at querying layer.

Ingestion layer: Deleting and reindexing document rather than simply updating document based on _id field

Since data is continuously changing at source, we are updating our index every 30 seconds based on changes at document level. Rather than using PUT to update a document we have to delete and re-insert document because we don’t know unique _id before the document is ingested in elasticsearch. For example: If restaurant name is changed to example2 for restaurantId 1 we are first deleting all the documents related to restuarantId 1 using delete by query and then post new document with updated information.

Querying layer: Uniquely identifying document if nested attributes are only used for filtering purposes and not exposed to client

We will be using nested attributes (shift_start, shift_end, person) only for filtering purposes and they won’t be returned as part of response to the client queries.

For example if I perform search on above documents with query : give me all records where shift_start greater than 2020-05-01T00:00:00 then elasticsearch will return both the documents.

While to the client we will return 2 records with same attributes value combination (restaurantId and name). To deal with this situation we decided to use composite aggregation to aggregate results based restaurantId and name.

Cluster metrics:

Index size: 6 GiB

Number of documents: 18,743,175

Shards (automatically calculated by elasticsearch): 22 shards (11 primary, 11 replicas)

Number of data nodes: 3

Number of dedicated nodes: 3

Instance type: r5.large.elasticsearch

Data nodes storage type: EBS

EBS volume type: General Purpose (SSD)

EBS volume size: 50 GiB

With above setup, ingestion of around 1,00,000 documents is taking around 17 seconds and querying the data using composite aggregation is taking more than 3 seconds if query has more filters and more than 20 seconds for * query.

What you have done to optimize above setup?

As per AWS documentation (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/sizing-domains.html#aes-bp-sharding) recommended shard size should be between 10 GB to 50 GB. Following this guideline I decided to reindex data into single shard as our data size was around 6GiB but it didn’t help as well.
One more thing I noticed was disabling index updation job was resulting in queries (except initial queries) with faster response times for both indexes (number of shards : 2 and 22)

Considering above scenario and data requirements any suggestions on how we can improve our query latencies. We are expecting query latencies to be under 2 seconds.

Thanks

system · October 13, 2020, 10:49pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Regarding the usage of nested fields Elasticsearch	2	218	October 23, 2023
Updating nested documents en masse Elasticsearch	1	491	November 23, 2017
Nested document or separate index Elasticsearch	3	191	June 5, 2023
Updating documents in an index with nested types Elasticsearch	2	411	November 12, 2018
Create and search nested document Elasticsearch	1	224	December 9, 2021

Best practices on indexing/querying nested document in elasticsearch

Related topics