Bulk Indexing performance on AWS ES service

michaels · November 1, 2017, 9:55am

Hi

I am performing a massive bulk indexing daily and it seems that whatever I do my cluster nodes quickly use 100% CPU for hours which causes my searches to perform bad.

My configuration is as follows:

AWS ES cluster with 5 m4.2xlarge.elasticsearch instances, 500GB SSD, 3000 provisioned IOPS.
My index has 5 shards and is approx. ~200GB
My document type contains in addition to basic field, an array of nested document mapping.
Each day I need to index ~11000000 nested documents into ~700000 documents
Indexing is done daily using Bulk update requests, each with 3000 items which are executed over next 20 hours (request every ~6 minutes)
Each update request contains a script for adding new items into an existing array.

Is there anything that can be done to reduce CPU load?
Is my indexing process correct?

thx
Michael

Christian_Dahlqvist · November 1, 2017, 10:28am

It seems like you are updating the same documents repeatedly during this bulk indexing run. Each update of a nested document will result in multiple documents being indexed behind the scenes, which will result in a lot of merging activity. I suspect you would be better performance if you could 'aggregate' the updates per document prior to indexing so that you end up performing a single larger update per document instead of multiple small ones.

michaels · November 1, 2017, 10:33am

Thx for your reply

This is what I do. Each of my update requests refers to a single unique document. in total I have ~700000 requests where each contains a script to add new nested documents to an existing array.

Christian_Dahlqvist · November 1, 2017, 10:44am

How large are the documents? How many nested objects/levels? How complex are the scripts performing the update? Which version of Elasticsearch are you using?

michaels · November 1, 2017, 10:48am

1 nesting level
Nested documents are fairly small, they contain up to 24 bytes string field and a long field

Christian_Dahlqvist · November 1, 2017, 10:49am

Which version of Elasticsearch are you using? How many nested documents do you have on average per main document?

michaels · November 1, 2017, 10:53am

Version 5.3 (as provided by AWS)
Since I have ~11000000 nested items to add, on average I add ~16 items to each document daily. The total length of the nested document array can grow quit large for each document.

Christian_Dahlqvist · November 1, 2017, 11:00am

What is the output from the cat indices API for this index?

michaels · November 1, 2017, 11:02am

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open myindex cVraQ7tpS86GthbXIbzqJQ 5 1 1739504457 1886305592 410.5gb 206.4gb

Christian_Dahlqvist · November 1, 2017, 11:06am

It looks like each document have an average of over 2.4k nested objects. Each update of a document will therefore result in that many documents being updated behind the scenes. Have you considered switching to a denormlized, flat data model or perhaps a parent-child relationship as these would result in inserts rather than very expensive updates?

michaels · November 1, 2017, 11:20am

I was not aware that adding new nested objects results in an expensive update of all old nested objects.
The nested documents are time based and I need to perform aggregation queries on the main document given a specific time frame and values of nested documents.
I will consider moving to parent-child relationship if I can achieve the same.

Christian_Dahlqvist · November 1, 2017, 11:30am

This behaviour of nested documents is described in Elasticsearch: the definitive guide. This section on data modelling is very useful, and even though it still references ES 2.x, most of it is as far as I know still valid.

michaels · November 1, 2017, 11:34am

Did not see that, definitely parent-child is more suitable for my use-case.
Thx for your support.

system · November 29, 2017, 11:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Performance Issue Elasticsearch	7	568	September 4, 2020
Bulk update is too slow elasticsearch 6.2 Elasticsearch	25	6849	June 4, 2018
Elasticsearch poor indexing performance Elasticsearch	6	852	December 1, 2017
Indexing performance drops when indexing a lot of nested documents Elasticsearch	2	614	January 10, 2017
Extremely slow nested aggregations, need suggestions on modeling/shards Elasticsearch	1	483	July 5, 2017

Bulk Indexing performance on AWS ES service

Related topics