Huge performance degradation during bulk indexing

Details about our usage:

  • We use ElasticSearch purely as an aggregation engine. Aggregations are almost always done across a limited time range.
  • We use a single index with about 200 million time-based documents totaling 377 gigabytes of primary storage (~2kb average document size). These are spread across 120 shards using default routing with a replication factor of 2
  • The above sits on 12 r4.xlarge data nodes with 3 r4.large master nodes
  • Every night, we recreate our index from our source of truth data stores
  • On an hourly basis, we run a bulk indexing job against the cluster which indexes ~400k additional documents (some may be updates, some may be new documents). All of these documents belong within a small window of time--the past two weeks. While we are indexing documents, we turn off index refreshes by setting refresh_interval to -1
  • Those are the only two ways that data is indexed into the cluster

The problem:

  • During and after our hourly job we see huge spike in Search Latency, and requests to the cluster which should normally take just a second or less take 10 seconds or more to return a response

I've attached a few diagrams that illustrate the behavior we see--bit spikes in CPU, search latency, and response times. Happy to provide any more metrics as needed,

We've been struggling with this issue for some time, and tried:

  • reducing the number of documents we index (down to 400k from 2 million)
  • throttling the number of concurrent indexing processes (down to 1 from 32)
  • trying to smooth out the indexing rate using a queue (latency spikes were even worse*)

*we used a queue to smooth out the indexing rate without adjusting the refresh interval. The results were not promising at all, so we did not try to do the same while turning off the refresh interval

One data point that we have is that on our much smaller staging cluster (only ~5 million documents, 3 data nodes of the same size as above, 20 shards instead of 120), it seems like we can reindex all of the documents at once and barely see any spike in latency at all. This has made me believe that splitting up the index into multiple indices and/or using custom routing may be a way to address this. The three main ideas we've discussed:

  • Hot/Cold indices - have one index containing data from the past month. We would only write to this index during the hourly indexing job
  • Monthly indices - have one index per month
  • Using month as a routing key - we would likely have to set routing_shards_num to at least 3 or 4 as well in this case in order to have enough routing keys for a relatively uniform distribution of data across our shards

The hypothesis here is that if we minimize the quantity of underlying data in the shards we "touch", the performance issues will be solved.

The issue is that without a good idea of what is happening in ElasticSearch to cause our current issues, we don't know which of these approaches would be best--or whether any of them would work at all. If anyone has any insight, it would be greatly appreciated. Happy to provide any more data as needed

If you are using gp2 EBS I would recommend looking at disk I/O and iowait as heavy indexing could result in I/O intensive merging that depleted your io credits.

1 Like

Thanks for the reply! I have a couple follow up questions if you don't mind: What sort of things affect the disk I/O usage? Number of shards you're indexing on, total size of those shards, number of documents being indexed? Intuitively I would guess that each of these have some impact, but is one of these typically the driver of high I/O?

There are definitely spikes in the I/O throughput (can't immediately figure out how to look at iowait at the moment), but they don't seem to explain the spikes in search latency. For example during our nightly job when we are reindexing the entire dataset in the background, the read and write throughput are both much higher and yet we don't experience the same latency spikes. That's not to say that isn't the problem, just that it isn't immediately obvious or could be a related issue

Of course it's the amount of data you're reading and writing that drives I/O, but the relationship is pretty complicated. I don't know that the number of shards or their total size would matter that much, but broadly speaking if you write twice as many documents then that's twice as much data to get onto the disk. Searches also result in disk reads if they miss the page cache.

Since you are seeing latency spikes continue even after indexing has finished, it's possible that there's still ongoing I/O due to merges, which happen in the background after indexing:

GET /_nodes/stats?filter_path=nodes.*.indices.merges

Updates in particular might be hurting here. An update involves deleting the document in its original segment and then indexing the new version into a new segment, and then eventually the original segment accumulates enough deletes and is merged (i.e. rewritten). Any segments in a single shard can be merged, and since you only have a single index your segments will contain documents with a wide variety of dates, only some of which will be subject to updates.

I think it would be worth considering moving to time-based indices so that you're not mixing up the static documents with the updated ones so much, which would avoid so much rewriting of old documents during merges. Can't promise this'll help but it seems plausible. It might also mean that you could avoid recreating the whole dataset on a nightly basis, and possibly speed up searches by skipping any shards that don't contain any data within the target time range.

I also think your shards sound quite small. 377GB across 120 shards is an average of ~3GB per shard. The usual recommendation is closer on 20-40GB per shard. Have you benchmarked your system to determine that 120 shards is performing better for your case than the recommendation?

Addendum: although it doesn't actually fix the problem so much as hiding it, adaptive replica selection may be useful for you.

That's great information, thank you! At the time being, unfortunately switching to purely time-based indices is not very straightforward due to the fact that our denormalized documents are composed of multiple mutable objects themselves, so there are almost no documents that we can index and forget about. We would like to build a solution that will not require us to rely on batch processing to populate our index, but for the medium term we'd just like to get what we have working well if possible.

That said, since we reindex everything nightly we can change how our index is structured or split it into several indices quite easily. I have tested a solution using two indices--one for "hot" documents that might be updated and another for "cold" ones that will not be. This seemed to improve the performance, but I could not come up with a good explanation as to why. Could it be that this simply reduces the number of total merges that need to be done?

Even so--I haven't really been able to pin down what the actual bottleneck is--based on the documentation I can find we should have 106 mb/s throughput and a limit of 6000 iops/s--at the peaks our usage seems to be closer to ~20 mb/s and under 2k iops/s. At a higher level, graphs of neither the iops nor the throughput metrics have the plateau-like shape I would expect to see if there was a clear bottleneck there.

Also, interestingly the latency spikes do not always seem to correspond with merges--in many cases they do, but for example below the worst spike comes separately from any merges:

As for the number of shards, we haven't done a ton to optimize on that front. I am aware that our shards are a bit lower than recommended, but I tried adjusting this value in our staging environment and it didn't seem to make a big difference in the behavior in regards to this particular issue. Would there be some particular benefit you could think of to larger shards on this particular issue? From my understanding it may improve our search performance and is probably worth doing anyway, but I'm just curious.

In regards to ARS - I actually had just found out about that and was excited to try it, but then learned it is not yet supported by AWS Elasticsearch Service :frowning_face:

Thanks so much for your patience and help!!

Yes, that is indeed interesting. I still suspect I/O, although it may not be your I/O. Still, it is also interesting that these spikes seem to be correlated across nodes.

Not particularly, I just wanted to check that you'd done some measurements here. It seems that you have.

Ah ok, you're using AWS Elasticsearch, hence the difficulty in getting iowait stats. r4.xlarge instances only have EBS volumes - can you see any metrics about wait times/bandwidth/burst credits for these volumes that correlate with your latency spikes? Can you try using local SSDs (e.g. upgrade to r5d.xlarge) to rule out EBS as a source of these issues?

For anyone else who comes across this thread, switching to using two indices and only writing to the smaller one on an hourly basis more or less solved this problem for us. It may well have been the IO that was slowing us down, though it's not entirely clear--once we made the change, IOPS usage and merges both went down significantly so it's still plausible. We haven't positively be able to confirm one specific explanation for why that is, but in general the learning that splitting up our large indices into smaller ones could be an effective way to combat this issue was interesting in itself

Thanks @DavidTurner and @Christian_Dahlqvist for your help!

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.