ES: Indexing Rate dropping significantly while Indexing Latency staying about the same

Hi,

I'm indexing ~140 GB of data via the bulk API on a managed AWS instance. I'm noticing the Indexing rate is almost at 30% of what it started at, while the indexing latency is staying the same.

I'm using EBS SSD as the backing store with 2 nodes with 64 gb memory each. My CPU and memory usage seem at pretty normal levels as well:

Any ideas on what could be throttling this? Does AWS maybe throttle inbound traffic or writes to EBS?

The larger shards get, the larger the segments get that need to be merged, which leads to more disk I/O. If you are using gp2 EBS, this has a fixed level of IOPS that you will be reaching once you have depleted you IOPS burst bucket. I am therefore not surprised you are seeing indexing rate drop over time unless you are using EBS with sufficient PIOPS.

Gotcha, that makes sense. I found this little bit of info in the AWS docs

An important thing to note is that for any gp2 volume larger than 1 TiB, the baseline performance is greater than the burst performance.

I'm currently using 500 GB gp2 volumes, I'm bumping it up to 1 TB to see if there is a significant improvement.

Also for a large batch job like this where data isn't needed in real time, increasing the index.refresh_interval should help as well right?

Yes, those should both help as allocated IOPS is proportional to volume size. Have a look at the docs for further suggestions.

Thanks for the quick reply! Something I just thought of is if the issue is with the burst IOPS, shouldn't the indexing rate have a sharper drop, rather than a gradual drop? I thought of this while looking at this AWS documentation: https://aws.amazon.com/blogs/database/understanding-burst-vs-baseline-performance-with-amazon-rds-and-gp2/

image

If you started indexing into an empty index the merging activity will also change over time so I am not sure that is the case. Do you have any monitoring of disk I/O and /or iowait over time? You shuld also be able to see from stats whether merging is being throttled or not.

If you are specifying your own document IDs the indexing rate is also likely to drop as the index grows.

Yes I started with indexing into an empty index, here are some more metrics from the beginning of the ingest:

We are in fact specifying our own document IDs, but we need to do this to avoid potential deduplication of our data in ElasticSearch. If there is a different way to deal with duplicate data, I'd be more than happy to switch to using ES doc IDs

If you specify your own IDs, which generally as far as I know is required for deduplication, each insert is in fact a potenial update and Elasticsearch must check if the ID already exists. This will slow down indexing speed over time as the number of documents to check increases. If you can sort your documents by ID lexically before indexing you may see some boost (or at least slower deterioration of performance - see this blog post for further details). Faster storage will also help.

Just as an update I retried the batch ingest with a 1 TB volume (so bursts shouldn't have an effect) and also with the following settings:

"number_of_replicas" : 0,
"refresh_interval": -1

and we're seeing the same pattern of decrease in index rate.

Another update: Using the auto generated ID fixed the issue. Thanks for your help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.