Team,
I am reaching out to you regarding best practices on indexing/querying elasticsearch in order to minimize latencies we are seeing with our cluster.
We had a use-case where we needed to ingest data which is nested in nature and is continuously changing at source.
Considering nested queries expensive as per elasticsearch documentation (https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html)
we decided to covert nested document into flat document for indexing. Example below:
Nested document:
{
restaurantId: 1,
name: example,
shift: [
{
start: date(),
end: date(),
person: A
},
{
start: date(),
end: date(),
person: B
}
]
}
Flat documents:
{
restaurantId: 1,
name: example,
shift_start: “2020-05-01T08:00:00”,
shift_end: “2020-05-01T10:00:00”,
person: A
}
{
restaurantId: 1,
name: example,
shift_start: “2020-05-01T10:00:00”,
shift_end: “2020-05-01T12:00:00”,
person: B
}
This presents 2 challenges, first at ingestion layer and second at querying layer.
Ingestion layer: Deleting and reindexing document rather than simply updating document based on _id field
Since data is continuously changing at source, we are updating our index every 30 seconds based on changes at document level. Rather than using PUT to update a document we have to delete and re-insert document because we don’t know unique _id before the document is ingested in elasticsearch. For example: If restaurant name is changed to example2 for restaurantId 1 we are first deleting all the documents related to restuarantId 1 using delete by query and then post new document with updated information.
Querying layer: Uniquely identifying document if nested attributes are only used for filtering purposes and not exposed to client
We will be using nested attributes (shift_start, shift_end, person) only for filtering purposes and they won’t be returned as part of response to the client queries.
For example if I perform search on above documents with query : give me all records where shift_start greater than 2020-05-01T00:00:00 then elasticsearch will return both the documents.
While to the client we will return 2 records with same attributes value combination (restaurantId and name). To deal with this situation we decided to use composite aggregation to aggregate results based restaurantId and name.
Cluster metrics:
Index size: 6 GiB
Number of documents: 18,743,175
Shards (automatically calculated by elasticsearch): 22 shards (11 primary, 11 replicas)
Number of data nodes: 3
Number of dedicated nodes: 3
Instance type: r5.large.elasticsearch
Data nodes storage type: EBS
EBS volume type: General Purpose (SSD)
EBS volume size: 50 GiB
With above setup, ingestion of around 1,00,000 documents is taking around 17 seconds and querying the data using composite aggregation is taking more than 3 seconds if query has more filters and more than 20 seconds for * query.
What you have done to optimize above setup?
- As per AWS documentation (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/sizing-domains.html#aes-bp-sharding) recommended shard size should be between 10 GB to 50 GB. Following this guideline I decided to reindex data into single shard as our data size was around 6GiB but it didn’t help as well.
- One more thing I noticed was disabling index updation job was resulting in queries (except initial queries) with faster response times for both indexes (number of shards : 2 and 22)
Considering above scenario and data requirements any suggestions on how we can improve our query latencies. We are expecting query latencies to be under 2 seconds.
Thanks