For 50000 records it takes 20 mins. why is it so slow? how do i increase the performance?
By the way, I tried using slices and tried various batch/scroll sizes. Not much impact.
The actual requirement is to update the field of 4 million records. Looks like it will take hours.
7.17 version.
16 GB Ram and 300 GB storage, index defined with 3 shards and 100 GB of data is present and as custom routing used the query process on only one shard.
One level of nested field is there, but the query used is not nested and the field to be updated is also not in nested type.
That does not really matter. Nested documents are stored as a collection of documents behind the scenes and all of these are reindexed on any update, which adds overhead. The reason for this is that Lucene uses immutable segments, and all related nested documents need to reside in the same segment.
Updating a lot of documents using update-by-query can therefore result in a lot of disk I/O. If you have slow storage this can become a bottleneck. What type of storage are you using? Local SSD? Have you monitored await and disk utilisation while you run the update-by-query?
Given Query filters only parent documents not nested documents, and updating just 50000 records, will it update complete documents including nested documents? if yes, is there a way to skip updating complete documents? is there a way to just to update the field alone?
No. All related nested documents need to be in the same segment and as segments are immutable all related documents must be written to the new segment where the update is written.
No. As Elasticsearch and Lucene relies on immutable segments, in place updates are not possible.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.