What would cause refresh=wait_for to regularly take 2-5 seconds?

Hi,

I am facing a bit of trouble with something we're trying to do in our system and was hoping someone could give me some advice!

We've been using elastic search quite heavily in our architecture, for text search but also quite heavily for aggregations for financial reporting. It works really well - the querying is super fast and flexible. Our basic architecture is that someone saves a document to our main data store, we put it on a stream (kinesis) and then a consumer picks it up and indexes it in elastic search.

A new feature we implemented required us to pre-calculate an aggregation so that we could sort by it. We have 2 entities that are related (invoices and customers), and we want to be able to sort customers by the balance of the outstanding invoices. To do this, we have to pre-calculate the outstanding balance and store it with the customer record in elastic search. So every time we read from our stream, we update the index for the invoice and then re-run the aggregation to see if the outstanding balance has changed.

However due to the refresh interval (we leave it to default, so its 1000ms) we have to wait for the refresh to happen before we can aggregate. Luckily there's an easy way to do this in elastic search - refresh=wait_for.

We thought that by doing refresh=wait_for on the query that indexes the invoice would wait on average 500ms (sometimes we will index just before the refresh interval, sometimes just after, and everything in between) and never much more than 1000ms.

This was indeed the case in our staging environment, however when we run in production this is not the case. Production has a much larger dataset and size to our staging - we have 172 active shards, 13 nodes, and ~1.2tb of data, including 1tb in the invoices index that we are waiting for).

Here are some graphs to illustrate what I mean. First off, here's 30 second increments over a 20 minute period of the # of times we update the invoices index. We've optimised it so we don't always do refresh=wait_for here, so not all of them are waiting for the refresh.

Over the same time period, here is the maximum query time in milliseconds for updating the document index

Here is the average for the queries where we do a wait_for

On average its 2 seconds, and it goes up to 4-6 seconds, which is quite a lot for us as we're trying to get it close-ish to real time.

I saw this post: `refresh=wait_for` taking unexpectedly long which seems like a similar issue, and there were many interesting discussions there, however I couldn't find anything that would help us identify the root cause. I looked at my threads and there didn't seem like there was a lot going on. I'm also a bit nervous about setting refresh=true on my query like @_markus did because the docs suggest we shouldn't, however it seems like it worked out fine for that issue.

Interestingly, we are also using a bulk query to update, which updates a sub-object in some instances (there doesn't seem to be a correlation between that and the longer queries).

So my question is - what kind of factors would cause this time to go up? I'm interested in understanding the potential technical reasons under the covers. Let me know if there's anything else that can provide more context, or if there's a better place to raise this!

Are you running Monitoring (from X-Pack Basic) on this cluster at all?

Are you using nested documents or parent-child relationships in your indices? If so, this thread may be applicable.

hi @warkolm - we're using AWS Elastic Search so we can't use x-pack unfortunately. We are exporting a lot of metrics into prometheus though, is there something particular we should look at?

hi @Christian_Dahlqvist - we are using parent>child relationships. I was confused that sometimes we get slow refreshes even when the document has not got any children, however thinking it through perhaps there's other documents on that refresh cycle that have updates to children. This is a great lead, thank you! We're in the process of migrating away from parent>child relationships as we've had other problems with it in the past, perhaps if we finish that migration we will get a speedup here.

Out of interest, is there any metric or something I could look at to verify that this is the problem?

Hey @Rod_Howarth,

thanks for initially tagging me! I didn't respond yet because I wanted to observe first if there are any new findings. Since then, two things happened:

  • The performance problem I mentioned in my post months ago came up again and was deemed problematic enough to be investigated
  • (at least to me) new information in connection with long refresh surfaced: the parent-child relation.

I brought up parent-child in my thread but only late in the game and there were no further responses then.

Fast forward present time: thanks to this thread and reading more in the docs, especially about the _parent field and the global ordinals, for it's clear now that exactly this is the problem (these links reference ES 5.3 because that's the version I'm running). I already was able to proof this to myself with benchmarks, see below for details what I did.

I'm not sure yet if I can get rid of parent-child. We introduced it ~9 months ago because the old system was problematic. Some parents have ~100k children which is not feasible to always index together, especially when many children are added constantly (i.e. social feeds, etc.)

I learned that with ES 5.x the default for global ordinals it to be eager loaded on refresh time. It's possible for the _parent field to change that to be lazily loaded with specifying eager_global_ordinals=false.

My (simplified) understanding of what happens:
This effectively moves the problem from index/refresh time to search time: the first time a query runs, the global ordinals are built. This has problems like "unlucky first user" and also when concurrent search requests hit as global ordinals may be rebuilt unnecessary multiple times. I could not yet figure our if that affects all queries negatively or only those having has_child queries.

What happens under the hood is: every time a refresh happens and in shard parent documents have been updated, the global ordinals for that shard needs to be (re?)built. In our case with ~17 mio parents on 5 shards. Which in our case can take >1.4s currently which is unbearable for our users.

Our use-case is that some user actions require immediate update and refresh of the index to re-display the data correctly. In hindsight this might be a bad architecture decision to rely that much on ES from our frontend but it is currently what it is.

Also, we constantly add/update documents in the background (synchronized from our primary DB system). It's not a lot, average 50 docs/s and peaks at 100 docs/s and these operations to not perform refreshes but this still means that global ordinals are invalidated easily often.

The understanding of my options so far:

  • get rid of parent-child (duh!)
  • use more shards with less parent documents each
  • use faster system (local SSD disks; we're currently running this instance virtualized…)

Sorry I can't provide really new information; I summarized my new learning for my own sake here too.

cheers,

  • Markus
1 Like

Oh, I forgot: even back then when we introduced parent-child I was eager to hop on refresh=wait_for but it turned out to have an even greater latency then refresh=true (i.e. forcing it) so we actually went with the latter…

Thanks so much for your detailed reply @_markus! It sounds like you've tracked it down, and have saved me from spending lots of time digging into things. It seems like something that would be good to have called out in the docs - maybe even just for the refresh=wait_for setting to let you know that it will take a long time.

For our implementation we have parent child relationships but are in the process of migrating away from them, we already have stopped querying them and are about to remove them from our index completely. I'll let you know how it goes!

I was about to write a more detailed benchmark here whilst tuning our ES system.

However due to the uncertainties and our small team we actually decided to remove the necessity for refresh=true (or wait_for) and move away from using ES so intensively for showing real time information but rather relying more on our primary store.

As such all the preliminary benchmarks I had were only from a local dev system and they weren't exhausted, but I already could measure: the smaller the shards are, the faster the refresh=true will be.

In hindsight it's easy to see that our production system isn't well tuned, relying on the default of 5 shards which now are ~80GB each and not having the fastest IO system is the source of our latency. On a test system, for simple comparison, I tried 50 shards and the refreshes were much much faster. But also the search times went up, as it had to wait for more shards and the test system was nowhere properly tuned (single node, 6 CPU, etc.) but it shows were to go.

We will revisit ES tuning at a later stage but we're first changing our architecture here.

  • Markus

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.