Hi,
I am facing a bit of trouble with something we're trying to do in our system and was hoping someone could give me some advice!
We've been using elastic search quite heavily in our architecture, for text search but also quite heavily for aggregations for financial reporting. It works really well - the querying is super fast and flexible. Our basic architecture is that someone saves a document to our main data store, we put it on a stream (kinesis) and then a consumer picks it up and indexes it in elastic search.
A new feature we implemented required us to pre-calculate an aggregation so that we could sort by it. We have 2 entities that are related (invoices and customers), and we want to be able to sort customers by the balance of the outstanding invoices. To do this, we have to pre-calculate the outstanding balance and store it with the customer record in elastic search. So every time we read from our stream, we update the index for the invoice and then re-run the aggregation to see if the outstanding balance has changed.
However due to the refresh interval (we leave it to default, so its 1000ms) we have to wait for the refresh to happen before we can aggregate. Luckily there's an easy way to do this in elastic search - refresh=wait_for.
We thought that by doing refresh=wait_for on the query that indexes the invoice would wait on average 500ms (sometimes we will index just before the refresh interval, sometimes just after, and everything in between) and never much more than 1000ms.
This was indeed the case in our staging environment, however when we run in production this is not the case. Production has a much larger dataset and size to our staging - we have 172 active shards, 13 nodes, and ~1.2tb of data, including 1tb in the invoices index that we are waiting for).
Here are some graphs to illustrate what I mean. First off, here's 30 second increments over a 20 minute period of the # of times we update the invoices index. We've optimised it so we don't always do refresh=wait_for here, so not all of them are waiting for the refresh.
Over the same time period, here is the maximum query time in milliseconds for updating the document index
Here is the average for the queries where we do a wait_for
On average its 2 seconds, and it goes up to 4-6 seconds, which is quite a lot for us as we're trying to get it close-ish to real time.
I saw this post: `refresh=wait_for` taking unexpectedly long which seems like a similar issue, and there were many interesting discussions there, however I couldn't find anything that would help us identify the root cause. I looked at my threads and there didn't seem like there was a lot going on. I'm also a bit nervous about setting refresh=true on my query like @_markus did because the docs suggest we shouldn't, however it seems like it worked out fine for that issue.
Interestingly, we are also using a bulk query to update, which updates a sub-object in some instances (there doesn't seem to be a correlation between that and the longer queries).
So my question is - what kind of factors would cause this time to go up? I'm interested in understanding the potential technical reasons under the covers. Let me know if there's anything else that can provide more context, or if there's a better place to raise this!