I'm indexing ~140 GB of data via the bulk API on a managed AWS instance. I'm noticing the Indexing rate is almost at 30% of what it started at, while the indexing latency is staying the same.
The larger shards get, the larger the segments get that need to be merged, which leads to more disk I/O. If you are using gp2 EBS, this has a fixed level of IOPS that you will be reaching once you have depleted you IOPS burst bucket. I am therefore not surprised you are seeing indexing rate drop over time unless you are using EBS with sufficient PIOPS.
If you started indexing into an empty index the merging activity will also change over time so I am not sure that is the case. Do you have any monitoring of disk I/O and /or iowait over time? You shuld also be able to see from stats whether merging is being throttled or not.
If you are specifying your own document IDs the indexing rate is also likely to drop as the index grows.
We are in fact specifying our own document IDs, but we need to do this to avoid potential deduplication of our data in ElasticSearch. If there is a different way to deal with duplicate data, I'd be more than happy to switch to using ES doc IDs
If you specify your own IDs, which generally as far as I know is required for deduplication, each insert is in fact a potenial update and Elasticsearch must check if the ID already exists. This will slow down indexing speed over time as the number of documents to check increases. If you can sort your documents by ID lexically before indexing you may see some boost (or at least slower deterioration of performance - see this blog post for further details). Faster storage will also help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.