Merging of segments results in java.lang.OutOfMemoryError: Java heap space

On Sun, Aug 2, 2015 at 6:48 AM, Srinath C noreply@discuss.elastic.co wrote:

Srinath_C http://discuss.elastic.co/users/srinath_c
August 2

@mikemccand http://discuss.elastic.co/users/mikemccand thanks for
taking a look at this. In these tests, the documents are pretty much
identical. Maybe a few of them like timestamp, clientPort and clientIP
might vary depending upon the client generating the data.

In these documents, all string fields have "index":"not_analyzed" while
the numeric fields have "norms": { "enabled": false }. For these test, most
of the string fields remain the same. Numeric fields might vary.

OK then I'm baffled. Are you using all default settings?

How large is Java's heap vs the total RAM on the machine?

Is there any APIs that I could run to get us an insight?

The segments API tells you heap used by Lucene's segment readers, but from
looking at yours above, they look low.

Can you get a heap dump at OOME and then post a screenshot of the top live
objects using heap?

Mike McCandless

Java heap is configured to 50% of RAM.
I'll try to capture heap dumps and report back.

Here are a few logs and my configurations .
I'd like to add that each of the documents has an inner/nested document with around 10 fields of type long. Are there any special considerations or tunings w.r.t. to that?

In addition to your existing segment merge tuning efforts, I would like to suggest to reduce max_merged_segment from 5g to something like 1g with respect to your maximum heap size of 4g

The durations of around >15 hours you reported look to me like your system works smoothly up to a certain point but then segment merges passed a critical point and fell down. A reason may be the segments size does no longer fit to the memory resources to allocated for that, or the overall merge CPU processing power can not keep up with the ongoing indexing.

One strategy would be to add more nodes or more memory, another would be to re-adjust the tuning and streamline the segment merging to operate smoothly under the given limited memory resources. For more aggressive segment merging, you could increase the max_thread_count of the merge scheduler from only 4 to a higher value depending on your CPU core count, maybe something like 16. This should take some pressure from the heap memory because this should speed the concurrent merge processes.

On my systems with 32 CPU cores, I use

index:
  merge:
    scheduler:
       type: concurrent
       max_thread_count: 16
    policy:
      max_merged_segment: 1gb
      segments_per_tier: 4
      max_merge_at_once: 4
      max_merge_at_once_explicit: 4

Of course you should monitor the segment counts and sizes while indexing is ongoing, if they match your requirements.

I've tried with max_merged_segment = 2g as well as with heap sizes of 4G (c3.xlarge) to 7G (m3.xlarge). All ended up with the same result.

From what I can see from marvel, even if there is one long GC run it destabilizes the entire indexing activity after that. It doesn't recover for long. Maybe increasing max_thread_count works for your hardware spec, but I'm running with only 4 cores and not sure if that would be a good idea.

Here are some screen captures from marvel. They were captured from a system that was no longer indexing documents after about 8 hours.

Hi @Srinath_C

You appear to be using bulk indexing. How many documents do you send at once, and roughly how big is each document? Also, are you using delete by query? Or mapping transform with a script? Any scripts anywhere else?

Have you configured any thread pools or queues? If so, what are they? Are you running multiple instances of ES on one server, or on a VM (if so which VM?).

Is swap disabled, or is mlockall enabled (look for mlockall:true in GET /_nodes)

</random questions>

Hi @Clinton_Gormley

I used to send around 10k documents, each having around 81 fields. For the tests, the size of each document was around ~1.5k but in real scenario it could be much larger.

There are no deletes by query but we do have TTL of 24 hrs. So deletes do happen in the background. Scripts are used for queries but during these test we did not hit any queries. No mapping transforms.

You can take a look at the entire configuration here. I've only configured threadpool.bulk.queue_size: 100. Each node runs on a dedicated ec2 instance of type c3.xlarge. Have also tried m3.xlarge with the same results.

And mlockall is enabled while swap is disabled.

Hi @Srinath_C

You're not using SSDs, correct? Nor do you have provisioned IOPS? I think you're overwhelming your disks, and Elasticsearch is paying the price because you're trying to bulk things up in memory. Your index_buffer_size is very high, leaving little space for anything else. Then you're queueing up lots of big bulk requests which also take more memory.

Try:

  • sending smaller bulk requests, eg 1,000 docs at a time
  • you're using your own doc IDs I think, if you can use auto-generated IDs, better
  • set indices.memory.index_buffer_size to 20%`
  • set your refresh_interval higher if you can (eg 30s or 60s)

Remove the following settings:

index.codec.bloom.load: false
index.compound_format: false
index.compound_on_flush: false
index.merge.policy.max_merge_at_once: 4
index.merge.policy.max_merge_at_once_explicit: 4
index.merge.policy.max_merged_segment: 5gb
index.merge.policy.segments_per_tier: 4
index.merge.policy.type: tiered
index.merge.scheduler.max_thread_count: 4
index.merge.scheduler.type: concurrent
index.translog.flush_threshold_ops: 50000
index.translog.flush_threshold_size: 2gb
index.translog.interval: 20s
indices.store.throttle.type: none
threadpool.bulk.queue_size: 100

Why are you using TTL? Wouldn't it be better to have an index per day, and just delete the old indices when you're done? It saves a lot of I/O.

@Clinton_Gormley we do use instance store SSDs - they come with xlarge instances on AWS.
We did reduce the the bulk size, we brought it down to a max of 5Mb as per suggestions on this list and also found it contributing to the overall stability - especially in reducing the number of bulk request rejections. Our IDs are auto generated.

I guess in case of SSDs it makes sense to retain:
index.merge.policy.max_merge_at_once: 4
index.merge.policy.max_merge_at_once_explicit: 4
index.merge.policy.max_merged_segment: 2gb
index.merge.policy.segments_per_tier: 4
index.merge.policy.type: tiered
indices.store.throttle.type: none

We've reduced index.merge.policy.max_merged_segment to 2gb.

We haven't actually experimented using an index for every day, but we can try that as well. Thanks for all your suggestions.

Hi @Srinath_C

We did reduce the the bulk size, we brought it down to a max of 5Mb as per suggestions on this list and also found it contributing to the overall stability - especially in reducing the number of bulk request rejections.

OK - I'd perhaps reduce each bulk request size more. Just for some background: you send a bulk request to the coordinating node. It splits it into a bulk request for each involved shard. The shard level bulk is executed serially.

If instead, you have more and smaller bulks, you will end up with more shard-level bulk requests, each one doing its job serially, but they run in parallel. Each bulk request takes up less memory (because they're smaller), both for the request and for the response.

If you have a high bulk queue size, you're just using up lots of memory to queue up bulks, when you should really handle that application side. If you're getting bulk rejects, then it means that ES isn't keeping up and you should back off in your application and retry later.

Reducing the index buffer size from 50% to 20% also frees up 30% of your heap space, which means that ES has more room to handle memory spikes from merging/querying/whatever.

I guess in case of SSDs it makes sense to retain:

Definitely set indices.store.throttle.type: none.

I would, however, delete all the merge settings. These are expert settings and are not easy to reason about. I also don't think they're the source of your problem. Setting index.merge.policy.max_merged_segment to 2gb is just going to result in more segments, which will slow down search. If merges are not keeping up, then Elasticsearch will throttle indexing down to one thread, which should cause your application to back off and allow things to catch up.

Honestly, just sorting out your memory issues will cause a massive improvement because the JVM won't be spending lots of time trying to find a few extra bytes.

You may also want to look at changing the linux scheduler from CFQ to noop. While CFQ is supposed to do the right thing, SSDs are not always detected correctly by the OS and quite often you can get better SSD throughput with the noop scheduler.

We haven't actually experimented using an index for every day,

Definitely worth doing. Expiring documents means marking them as deleted, then doing a merge to remove deleted documents. Probably not as bad as it could be, because you're probably dropping whole segments, but an index per day will definitely be more efficient.

Hi @Clinton_Gormley, thanks so much for the detailed reply. It really helps to get an insight.
I'll configure as you suggested and run the tests.

Just to clarify, the bulk import is being done using a NodeClient. I guess then the split of bulk request into shard level bulk requests would happen on the client side and dispatched to the nodes holding the primary shard. Is that correct?

Hi @Clinton_Gormley,

Today, I ran tests on two setups:
setup-a: on one setup I removed all the index.merge.policy.* settings
setup-b: on another setup I retained them as is (see links in previous replies)

The observation was that the performance is better in setup-b than in setup-a. In setup-a, the CPU utilization went as high as 60-70% and after sometime when the merging kicked in, I started to see builk request rejections. However, in setup-b, the CPU utilization always remained < 10% from the start.

The difference is amuzing. Hope you had a chance to see the conversation I had a year back with Mike. He had suggested these setting then. Could there be anything I'm missing? These results suggest that these settings really make a huge difference.

Hi @Srinath_C

I tend to recommend resetting everything to default because people have usually cargo-culted settings and removing them clears a lot of noise. Like I said, they're expert settings, but Mike is pretty much as expert as they get :slight_smile:

Also, +1 for actually testing the difference, instead of just blindly changing things. Be aware that the implementation of these things change from time to time, so I would revisit any settings that you have with later versions.

Did my other suggestions resolve the OOMs you were having?

Hi @Clinton_Gormley, the tests are still in progress. Will update back with my findings.

Actually, I haven't done a lot of testing combining consistent indexing activity along with query activity. Maybe your suggestions will make sense after having a lot of data and trying to query over them. I'm also trying with the noop queue scheduler. Will update back on that.

The OOMs didn't happen again - I'm attributing that to the reduced bulk sizes and concurrency levels. Hope that doesn't show up when I run the tests with queries.

Hi @mikemccand, @Clinton_Gormley,

On a new test setup, I'm unable to have a consistent performance from my cluster.
This cluster was setup with all the learnings from the previous runs - 3 master nodes (c3.large instances), 5 data nodes (m3.xlarge instances). After a ingesting documents at the rate of 4k per second for about 7 hours, we started to see issues of long GC pauses again. After that, we were never able to index records at that rate consistently.

Finally, the nodes ended up with heap dumps. I have captured suspected leaks using the memory analyzer tool. Please take a look at the reports here. Output of _nodes API is available here.

Just to summarize, the cluster is of version 1.7.1, 3 dedicated master nodes, 5 data nodes, 3 external NodeClient processes bulk inserting with concurrency 1 using BulkProcessor. Each bulk is configured to be at most 5Mb or 5000 documents or 10 seconds in batch.

I'm not sure what is missing in the new test. All configurations (elasticsearch.yml) seem to match the earlier setup where we had no issues.

Hi @Srinath_C

Could you upload a diagnostics dump somewhere? https://github.com/elastic/elasticsearch-support-diagnostics

thanks

Hi @Clinton_Gormley,

Hadn't got a chance to run these tests for quite some time. After looking into it further, it seems that with passage of time the bulk queue was increasing in size - probably due to asynchronous bulk insertions. The CPU utilisation though was <50% on each node. These requests in the bulk queues are probably causing frequent and long GC pauses.

Would that be a convincing assessment? I could upload the diagnostics output, but it'll need some scrubbing as it captures all aws key/secrets and IP addresses.

I've tried out synchronous bulk inserts but I'm facing high and variable response times, probably will bring it up on a different thread.

Thanks.

Very nice thread glad I read through it. @Clinton_Gormley comments on bulk explain some things I had seen in the past. Coming from SOLR I initially started with very large batches which worked well in that environment. However I found with ES that while bulk definitely improved things, there were certain measurable diminishing returns from large batch sizes and settled on a relatively small batch size of 10 documents. This has continued to work well, and the explanation above explains why.

One thought that I don't see mentioned here regards your EC2 instance size. I would recommend that you at least try a cluster of m3.2xlarge data nodes half the size of your current cluster, either 2 or 3, maybe both in separate tests. The reason for this is that the OS has some overhead requirements for memory, that don't increase with an increase in memory. Generally when sizing I assume this to be around 2GB with modern Linux. This may be high but has worked well for me. If you consider the ratio of 2GB to 15 GB with your m3.xlarge, that is 13%, where as in a m3.2xlarge that is 2 out of 30 or 6%. This means with m3.xlarge you sacrifice 13% of your memory with each new node whereas an m3.2xlarge you sacrifice only 6%, leaving more memory available to caching, and HEAP. With the smaller instance sizes, 8GB or less, I don't think you should allocate 50% to heap but rather closer to (8GB - 2GB)/2.

I recently made a similar shift myself, moving from c3.xlarge to m3.xlarge. The results more than justified the additional expense. My cluster acts twice the size, I look forward to being able to justify the move to m3.2xlarge.

Thanks for the inputs @aaronmefford, I'll see if I can quickly run through those tests.

Couple of questions:
a, Can you provide more details on your bulk sizes? size, frequency, synchronous/asynchronous, etc. I'm facing some challenges on the response times from bulk indexing.
b, m3.2xlarge is 8 core whereas m3.xlarge has 4. Do you think the number of cores made the difference or was it convincing that the extra heap space is the truly differentiating factor?
c, Were you able to figure out what goes into the 2GB of space? It would appear that instance types like c3.large (4G memory) instances won't even seem like a good choice for elasticsearch.

Couple of questions:
a, Can you provide more details on your bulk sizes? size, frequency,
synchronous/asynchronous, etc. I'm facing some challenges on the response
times from bulk indexing
https://discuss.elastic.co/t/bulk-import-response-times/27918.

Sorry I don't currently have solid numbers on response times in this area.
My indexing is the Reduce phase of a Map/Reduce job. So it is parallel,
but I limit the number of reducers to 1 or 2 per node in the cluster. I
was looking at the code today and it is currently configured at 100
messages per batch. It was a few years ago that I did substantial testing
of batch size and found the diminishing returns with larger batches.

b, m3.2xlarge is 8 core whereas m3.xlarge has 4. Do you think the number of
cores made the difference or was it convincing that the extra heap space is
the truly differentiating factor?

In my case I moved from c3.xlarge to m3.xlarge, so memory was the real
difference. The cluster size did not change. In your case I think you
achieve a similar result by reducing the number of nodes when you step up
to the m3.2xlarge or m4.2xlarge. This will keep your costs and cores
similar but double your RAM per node, improving efficiency of the RAM.

c, Were you able to figure out what goes into the 2GB of space? It would
appear that instance types like c3.large (4G memory) instances won't even
seem like a good choice for elasticsearch.

The 2GB is just a rule of thumb that I have used as a buffer. It is
loosely based on other best practices I have encountered over the years but
more so just on the reality that the OS needs some resources too, and if
you don't leave enough for the OS, then you will have issues. Also,
remember that anything the OS doesn't use directly it will use for file
caching which significantly benefits ElasticSearch as well, so you arent
really loosing anything by leaving this buffer. I have run ElasticSearch
on smaller instances, it works for small projects, with small
requirements. But yes I would strongly recommend against anything under
8GB, especially if you have more than 2 nodes. I.E. scale up a bit before
scaling out. A 16 node cluster of m3.mediums is nonsense. In my
consulting practice I strongly recommend targeting 64GB per node, and
scaling up instead of out until you reach that size. I really wish that
AWS wasn't so stingy with memory, but that is a common challenge in all
virtualized environments.