Continuing after java heap space runs out

Hello, I am running many large bulk index requests in a short time period for testing purposes and I have run into Elasticsearch crashing because the java heap becomes filled. I see that there are ways to reduce the chances of having this happen, but I would like to know what my options are to recover the index and determine what has changed since the bulk call. It seems like the common scenario is to delete the index and rebuild it, but this is going to be too costly and I would rather attempt to fix the index if possible.

The only way would be to restart ES and then send the bulk you didn't get a response for and any other ones.

Thanks for the reply.

One problem I am encountering with this solution (albeit in testing where I'm using an intentionally small heap - 300MB) is that the heap fills up upon trying to recover the index. I notice that I this problem is reduced when I am running multiple nodes in my ES cluster as it appears that it promotes another node's shards to primary, but there is still the case when a node does need to recover and gets stuck in a crash cycle. The only solution I have for this is to increase the heap size (up to 600MB and it seems to always recover). I realize this may not be a problem with normal heap sizes ( say with the default - 2GB ), but we are trying to plan on how to avoid and if possible recover gracefully from this sort of error.

I also predict that there may be more to this issue, depending on what sort of call I am making. For example, say I make an update by query request. Then, depending on the query and script it may affect files differently when run a second time. I suspect there are recommendations against this any call that isn't consistent, but are there any known solutions to recovering if such is the case?

I have read through the Circuit Breaker page (https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html) and plan on tweaking some of the values to see if it this may be a solution.

Sorry for the long-winded response, hopefully you or someone else could be of some more help!

Providing a reasonable heap size is likely the key here.

The call is consistent, the outcome may not be.
Why do you think this is a problem for recovery?

I think that tweaking circuit breakers on a few hundred meg of JVM heap is a bit of a time waste, relative to simply giving ES GBs instead.

I'm worried about calls in which the result may not be consistent because of the following scenario:

  • Make a call to increment a value by 1 in matching queried documents.
  • heap fills up and ES crashes before all documents are updated.
  • Restart ES

In this case I can't simply rerun the update by query script because then some documents will have been incremented by 2.

I realize that tweaking the circuit breakers on such low memory is a waste, but we have seen ES crash in larger scale tests with 2GB of ram so I'm thinking about changing them there. The testing against 2GB was an extended test by sending data at a high rate for an extended period of time and ES crashes around the 1 hour mark. The test case is a likely scenario we have to face.

Thanks for the extended help.

Maybe try splitting the bulk into smaller chunks?

But again, you're optimising for the wrong thing here.

That does in fact give the cluster a much much longer lifetime. Sending smaller bulk requests in roughly the same throughput gives an output of about 5x the longevity before ES crashes. Sending 10k documents which were roughly 1KB in size and the cluster lasts about an hour, but splitting up the 10k documents into 100 documents with 100 requests and the cluster lasts about 5 hours. This comes at a small performance hit, but it will be nice to keep in mind for our use case, thank you!

My theory is that if we can construct a test on a very small scale, and ES won't break, then if we scale up ES won't break assuming we follow the guidelines understood when testing. Perhaps I'm wrong in assuming that if it the problem goes away at a small scale then we won't see it assuming the cluster and payload is scaled up as well.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.