Bulk Indexing performance on AWS ES service

Hi

I am performing a massive bulk indexing daily and it seems that whatever I do my cluster nodes quickly use 100% CPU for hours which causes my searches to perform bad.

My configuration is as follows:

  1. AWS ES cluster with 5 m4.2xlarge.elasticsearch instances, 500GB SSD, 3000 provisioned IOPS.
  2. My index has 5 shards and is approx. ~200GB
  3. My document type contains in addition to basic field, an array of nested document mapping.
  4. Each day I need to index ~11000000 nested documents into ~700000 documents
  5. Indexing is done daily using Bulk update requests, each with 3000 items which are executed over next 20 hours (request every ~6 minutes)
  6. Each update request contains a script for adding new items into an existing array.

Is there anything that can be done to reduce CPU load?
Is my indexing process correct?

thx
Michael

It seems like you are updating the same documents repeatedly during this bulk indexing run. Each update of a nested document will result in multiple documents being indexed behind the scenes, which will result in a lot of merging activity. I suspect you would be better performance if you could 'aggregate' the updates per document prior to indexing so that you end up performing a single larger update per document instead of multiple small ones.

Thx for your reply

This is what I do. Each of my update requests refers to a single unique document. in total I have ~700000 requests where each contains a script to add new nested documents to an existing array.

How large are the documents? How many nested objects/levels? How complex are the scripts performing the update? Which version of Elasticsearch are you using?

  1. 1 nesting level
  2. Nested documents are fairly small, they contain up to 24 bytes string field and a long field

Which version of Elasticsearch are you using? How many nested documents do you have on average per main document?

  1. Version 5.3 (as provided by AWS)
  2. Since I have ~11000000 nested items to add, on average I add ~16 items to each document daily. The total length of the nested document array can grow quit large for each document.

What is the output from the cat indices API for this index?

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open myindex cVraQ7tpS86GthbXIbzqJQ 5 1 1739504457 1886305592 410.5gb 206.4gb

It looks like each document have an average of over 2.4k nested objects. Each update of a document will therefore result in that many documents being updated behind the scenes. Have you considered switching to a denormlized, flat data model or perhaps a parent-child relationship as these would result in inserts rather than very expensive updates?

I was not aware that adding new nested objects results in an expensive update of all old nested objects.
The nested documents are time based and I need to perform aggregation queries on the main document given a specific time frame and values of nested documents.
I will consider moving to parent-child relationship if I can achieve the same.

This behaviour of nested documents is described in Elasticsearch: the definitive guide. This section on data modelling is very useful, and even though it still references ES 2.x, most of it is as far as I know still valid.

Did not see that, definitely parent-child is more suitable for my use-case.
Thx for your support.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.