I am performing a massive bulk indexing daily and it seems that whatever I do my cluster nodes quickly use 100% CPU for hours which causes my searches to perform bad.
My configuration is as follows:
AWS ES cluster with 5 m4.2xlarge.elasticsearch instances, 500GB SSD, 3000 provisioned IOPS.
My index has 5 shards and is approx. ~200GB
My document type contains in addition to basic field, an array of nested document mapping.
Each day I need to index ~11000000 nested documents into ~700000 documents
Indexing is done daily using Bulk update requests, each with 3000 items which are executed over next 20 hours (request every ~6 minutes)
Each update request contains a script for adding new items into an existing array.
Is there anything that can be done to reduce CPU load?
Is my indexing process correct?
It seems like you are updating the same documents repeatedly during this bulk indexing run. Each update of a nested document will result in multiple documents being indexed behind the scenes, which will result in a lot of merging activity. I suspect you would be better performance if you could 'aggregate' the updates per document prior to indexing so that you end up performing a single larger update per document instead of multiple small ones.
This is what I do. Each of my update requests refers to a single unique document. in total I have ~700000 requests where each contains a script to add new nested documents to an existing array.
How large are the documents? How many nested objects/levels? How complex are the scripts performing the update? Which version of Elasticsearch are you using?
Since I have ~11000000 nested items to add, on average I add ~16 items to each document daily. The total length of the nested document array can grow quit large for each document.
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open myindex cVraQ7tpS86GthbXIbzqJQ 5 1 1739504457 1886305592 410.5gb 206.4gb
It looks like each document have an average of over 2.4k nested objects. Each update of a document will therefore result in that many documents being updated behind the scenes. Have you considered switching to a denormlized, flat data model or perhaps a parent-child relationship as these would result in inserts rather than very expensive updates?
I was not aware that adding new nested objects results in an expensive update of all old nested objects.
The nested documents are time based and I need to perform aggregation queries on the main document given a specific time frame and values of nested documents.
I will consider moving to parent-child relationship if I can achieve the same.
This behaviour of nested documents is described in Elasticsearch: the definitive guide. This section on data modelling is very useful, and even though it still references ES 2.x, most of it is as far as I know still valid.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.