Merging of segments results in java.lang.OutOfMemoryError: Java heap space

Hi,

We found an OOM error when elasticsearch was merging segments.

Our cluster
4 instances of c3.xlarge instances running on aws running elasticsearch 1.7.0
There is continuous indexing of small documents (~5kb) at the rate of 4000 per second.
Documents are being written to ES from a storm cluster of 3 nodes.

Here are the JVM args used:
-server -Djava.net.preferIPv4Stack=true
-Xms4479m -Xmx4479m -Xss256k -XX:NewRatio=1
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError

After some period of time found that the cluster was in yellow state.
Found OOM errors on one of the nodes while looking into the logs and looking back at the first occurrence of this error, we found the following exception:

[2015-07-25 12:42:54,907][WARN ][transport.netty          ] [metrics-datastore-4-es-tune] Message not fully read (request) for requestId [129230], action [indices:data/write/bulk[s]], readerIndex [64362] vs expected [202995]; resetting
[2015-07-25 12:42:55,258][ERROR][marvel.agent.exporter    ] [metrics-datastore-4-es-tune] create failure (index:[.marvel-2015.07.25] type: [node_stats]): EsRejectedExecutionException[rejected execution (queue capacity 100) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1@1ec774e7]
[2015-07-25 12:43:01,545][INFO ][monitor.jvm              ] [metrics-datastore-4-es-tune] [gc][old][28074][1011] duration [6.1s], collections [1]/[6.5s], total [6.1s]/[46m], memory [4gb]->[3.8gb]/[4.1gb], all_pools {[young] [1.7gb]->[1.6gb]/[1.7gb]}{[survivor] [122.3mb]->[0b]/[224mb]}{[old] [2.1gb]->[2.1gb]/[2.1gb]}
[2015-07-25 12:43:02,508][WARN ][index.merge.scheduler    ] [metrics-datastore-4-es-tune] [asdf3003444][0] failed to merge
org.apache.lucene.store.AlreadyClosedException: refusing to delete any files: this IndexWriter hit an unrecoverable exception
        at org.apache.lucene.index.IndexFileDeleter.ensureOpen(IndexFileDeleter.java:354)
        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:719)
        at org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:451)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3826)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:409)
        at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:486)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.packed.PackedInts.getReaderIteratorNoHeader(PackedInts.java:865)
        at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$ChunkIterator.next(CompressingStoredFieldsReader.java:471)
        at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:368)
        at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:332)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4223)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3811)
        ... 3 more

Questions:
1, This seems to suggest that merging resulted in the OOM error. Does merging load the segments into memory while merging?
2, What is a safe value for index.merge.policy.max_merged_segment? In this setup it was set to 5g with index.merge.policy.max_merge_at_once: 4. Are we expected to configure these settings based on JVM heap size?
3, Looking into other settings, what is a good value for index.translog.flush_threshold_size?

1 Like

Merging does not load entire segments into memory but it indeed needs some memory. The stack trace points to a place where some memory is allocated but it's so small I don't think it is the actual root cause.

The settings you mention are really low-level and I don't think they are related to the failure you are seeing. Changing them might help avoid the issue but it would also probably introduce undesirable side effects.

Do you have some understanding of how Elasticsearch uses your memory already? If not the segments API would help see how much memory is used by the Lucene index, and the stats API could help understand how much memory fielddata and the filter cache use. They might use enough memory for merging to fail even with small memory requirements. Another common source of out-of-memory errors when merging is due to having many fields with norms, is it your case? The reason why norms are an issue is that in 1.x, norms are entirely memory-resident and merging needs to load them even if you never search on those fields.

Thanks @jpountz for looking into this. Most of these settings were established by discussing on the elasticsearch google group - see thread. They worked fine for the amount of data I had then but now the document size has increased to 1.5+ times and now we ended up with stability issues again.

In our mappings, norms are disabled for all fields of type boolean/long. All my string fields are "not_analyzed" so norms are disabled. It may not be an issue in my case.

The total amount of field cache is limited to 10% in my setup.

OK so norms are fine and the fielddata cache should be limited enough. Can you still check the actual memory usage as returned from the segments API and the stats API for fielddata and the filter cache?

@jpountz thanks for the inputs, here are the segments and fielddata info.
Let me know what you infer. Thanks.

Btw, one of the nodes hit an heap space error and these info were taken after restarting the cluster to a green state.

@jpountz another round of tests failed with heap space errors.

Have captured another set of segments info. Hope that helps.

Hey guys,

Have run a few more rounds of tests and it ends up with one or the other node going out of memory after a certain duration. Here are my inferences from the tests:

  • One or more nodes go OOM after a certain period of consistent indexing traffic (~3k per second)
  • Increasing the traffic will lead to the OOM situation sooner
  • Increasing the heap only results in delaying the OOM to later point in time
  • Once a node goes OOM, the indexing activity is severely impacted across the cluster
  • Manual intervention to restart the affected node is necessary

Here are some test summaries:

  1. Test 1

AWS Oregon region, 6 nodes of m3.xlarge, 3 indices
Total test duration 19 hours
- total load of 3000 tps equally split across 3 indices
- cluster membership issues seen after 19 hrs of smooth run
- reasons for instability: long GC pauses in nodes, cluster stability after 19 hrs

  1. Test 2

AWS N. California region, 6 nodes of m3.xlarge, 3 accounts, _source disabled
Total test duration 6 hours
- total load of 5000 tps equally split across 3 indices
- initial state: cpu at 10% initially
- final state: memory 30G/57G across the cluster, cpu at ~50%, data 31G
- cluster membership issues seen after 6 hrs of smooth run
- reasons - long GC pauses in nodes, cluster stability after 6 hrs

  1. Test 3

AWS Oregon region, 1 dedicated master node m3.xlarge, 5 data nodes of m3.xlarge, 3 accounts
18 hours - total load of 3000 tps
- initial state: cpu at ~10%, memory 17G/57G across the cluster
- final state: cpu at 40%, memory 27G/57G across the cluster, data 87G
- cluster membership issues seen after 18 hrs of smooth run

Has anyone been able to achieve better results with a similar cluster capacity? I seem to have hit a limit.

Indexing at any rate should never lead to heap exhaustion, though some
search-time data structures will consume heap in proportion to the number
of documents or number of unique values in a field.

But: what kind of documents are you indexing? Are you e.g. using random
field names or something? Lucene has a fixed, non-trivial heap usage for
each unique field ... maybe you're hitting that.

Mike McCandless

@mikemccand thanks for taking a look at this. In these tests, the documents are pretty much identical. Maybe a few of them like timestamp, clientPort and clientIP might vary depending upon the client generating the data.

In these documents, all string fields have "index":"not_analyzed" while the numeric fields have "norms": { "enabled": false }. For these test, most of the string fields remain the same. Numeric fields might vary.

Is there any APIs that I could run to get us an insight?

Regards,
Srinath.

On Sun, Aug 2, 2015 at 6:48 AM, Srinath C noreply@discuss.elastic.co wrote:

Srinath_C http://discuss.elastic.co/users/srinath_c
August 2

@mikemccand http://discuss.elastic.co/users/mikemccand thanks for
taking a look at this. In these tests, the documents are pretty much
identical. Maybe a few of them like timestamp, clientPort and clientIP
might vary depending upon the client generating the data.

In these documents, all string fields have "index":"not_analyzed" while
the numeric fields have "norms": { "enabled": false }. For these test, most
of the string fields remain the same. Numeric fields might vary.

OK then I'm baffled. Are you using all default settings?

How large is Java's heap vs the total RAM on the machine?

Is there any APIs that I could run to get us an insight?

The segments API tells you heap used by Lucene's segment readers, but from
looking at yours above, they look low.

Can you get a heap dump at OOME and then post a screenshot of the top live
objects using heap?

Mike McCandless

Java heap is configured to 50% of RAM.
I'll try to capture heap dumps and report back.

Here are a few logs and my configurations .
I'd like to add that each of the documents has an inner/nested document with around 10 fields of type long. Are there any special considerations or tunings w.r.t. to that?

In addition to your existing segment merge tuning efforts, I would like to suggest to reduce max_merged_segment from 5g to something like 1g with respect to your maximum heap size of 4g

The durations of around >15 hours you reported look to me like your system works smoothly up to a certain point but then segment merges passed a critical point and fell down. A reason may be the segments size does no longer fit to the memory resources to allocated for that, or the overall merge CPU processing power can not keep up with the ongoing indexing.

One strategy would be to add more nodes or more memory, another would be to re-adjust the tuning and streamline the segment merging to operate smoothly under the given limited memory resources. For more aggressive segment merging, you could increase the max_thread_count of the merge scheduler from only 4 to a higher value depending on your CPU core count, maybe something like 16. This should take some pressure from the heap memory because this should speed the concurrent merge processes.

On my systems with 32 CPU cores, I use

index:
  merge:
    scheduler:
       type: concurrent
       max_thread_count: 16
    policy:
      max_merged_segment: 1gb
      segments_per_tier: 4
      max_merge_at_once: 4
      max_merge_at_once_explicit: 4

Of course you should monitor the segment counts and sizes while indexing is ongoing, if they match your requirements.

I've tried with max_merged_segment = 2g as well as with heap sizes of 4G (c3.xlarge) to 7G (m3.xlarge). All ended up with the same result.

From what I can see from marvel, even if there is one long GC run it destabilizes the entire indexing activity after that. It doesn't recover for long. Maybe increasing max_thread_count works for your hardware spec, but I'm running with only 4 cores and not sure if that would be a good idea.

Here are some screen captures from marvel. They were captured from a system that was no longer indexing documents after about 8 hours.

Hi @Srinath_C

You appear to be using bulk indexing. How many documents do you send at once, and roughly how big is each document? Also, are you using delete by query? Or mapping transform with a script? Any scripts anywhere else?

Have you configured any thread pools or queues? If so, what are they? Are you running multiple instances of ES on one server, or on a VM (if so which VM?).

Is swap disabled, or is mlockall enabled (look for mlockall:true in GET /_nodes)

</random questions>

Hi @Clinton_Gormley

I used to send around 10k documents, each having around 81 fields. For the tests, the size of each document was around ~1.5k but in real scenario it could be much larger.

There are no deletes by query but we do have TTL of 24 hrs. So deletes do happen in the background. Scripts are used for queries but during these test we did not hit any queries. No mapping transforms.

You can take a look at the entire configuration here. I've only configured threadpool.bulk.queue_size: 100. Each node runs on a dedicated ec2 instance of type c3.xlarge. Have also tried m3.xlarge with the same results.

And mlockall is enabled while swap is disabled.

Hi @Srinath_C

You're not using SSDs, correct? Nor do you have provisioned IOPS? I think you're overwhelming your disks, and Elasticsearch is paying the price because you're trying to bulk things up in memory. Your index_buffer_size is very high, leaving little space for anything else. Then you're queueing up lots of big bulk requests which also take more memory.

Try:

  • sending smaller bulk requests, eg 1,000 docs at a time
  • you're using your own doc IDs I think, if you can use auto-generated IDs, better
  • set indices.memory.index_buffer_size to 20%`
  • set your refresh_interval higher if you can (eg 30s or 60s)

Remove the following settings:

index.codec.bloom.load: false
index.compound_format: false
index.compound_on_flush: false
index.merge.policy.max_merge_at_once: 4
index.merge.policy.max_merge_at_once_explicit: 4
index.merge.policy.max_merged_segment: 5gb
index.merge.policy.segments_per_tier: 4
index.merge.policy.type: tiered
index.merge.scheduler.max_thread_count: 4
index.merge.scheduler.type: concurrent
index.translog.flush_threshold_ops: 50000
index.translog.flush_threshold_size: 2gb
index.translog.interval: 20s
indices.store.throttle.type: none
threadpool.bulk.queue_size: 100

Why are you using TTL? Wouldn't it be better to have an index per day, and just delete the old indices when you're done? It saves a lot of I/O.

@Clinton_Gormley we do use instance store SSDs - they come with xlarge instances on AWS.
We did reduce the the bulk size, we brought it down to a max of 5Mb as per suggestions on this list and also found it contributing to the overall stability - especially in reducing the number of bulk request rejections. Our IDs are auto generated.

I guess in case of SSDs it makes sense to retain:
index.merge.policy.max_merge_at_once: 4
index.merge.policy.max_merge_at_once_explicit: 4
index.merge.policy.max_merged_segment: 2gb
index.merge.policy.segments_per_tier: 4
index.merge.policy.type: tiered
indices.store.throttle.type: none

We've reduced index.merge.policy.max_merged_segment to 2gb.

We haven't actually experimented using an index for every day, but we can try that as well. Thanks for all your suggestions.

Hi @Srinath_C

We did reduce the the bulk size, we brought it down to a max of 5Mb as per suggestions on this list and also found it contributing to the overall stability - especially in reducing the number of bulk request rejections.

OK - I'd perhaps reduce each bulk request size more. Just for some background: you send a bulk request to the coordinating node. It splits it into a bulk request for each involved shard. The shard level bulk is executed serially.

If instead, you have more and smaller bulks, you will end up with more shard-level bulk requests, each one doing its job serially, but they run in parallel. Each bulk request takes up less memory (because they're smaller), both for the request and for the response.

If you have a high bulk queue size, you're just using up lots of memory to queue up bulks, when you should really handle that application side. If you're getting bulk rejects, then it means that ES isn't keeping up and you should back off in your application and retry later.

Reducing the index buffer size from 50% to 20% also frees up 30% of your heap space, which means that ES has more room to handle memory spikes from merging/querying/whatever.

I guess in case of SSDs it makes sense to retain:

Definitely set indices.store.throttle.type: none.

I would, however, delete all the merge settings. These are expert settings and are not easy to reason about. I also don't think they're the source of your problem. Setting index.merge.policy.max_merged_segment to 2gb is just going to result in more segments, which will slow down search. If merges are not keeping up, then Elasticsearch will throttle indexing down to one thread, which should cause your application to back off and allow things to catch up.

Honestly, just sorting out your memory issues will cause a massive improvement because the JVM won't be spending lots of time trying to find a few extra bytes.

You may also want to look at changing the linux scheduler from CFQ to noop. While CFQ is supposed to do the right thing, SSDs are not always detected correctly by the OS and quite often you can get better SSD throughput with the noop scheduler.

We haven't actually experimented using an index for every day,

Definitely worth doing. Expiring documents means marking them as deleted, then doing a merge to remove deleted documents. Probably not as bad as it could be, because you're probably dropping whole segments, but an index per day will definitely be more efficient.

Hi @Clinton_Gormley, thanks so much for the detailed reply. It really helps to get an insight.
I'll configure as you suggested and run the tests.

Just to clarify, the bulk import is being done using a NodeClient. I guess then the split of bulk request into shard level bulk requests would happen on the client side and dispatched to the nodes holding the primary shard. Is that correct?

Hi @Clinton_Gormley,

Today, I ran tests on two setups:
setup-a: on one setup I removed all the index.merge.policy.* settings
setup-b: on another setup I retained them as is (see links in previous replies)

The observation was that the performance is better in setup-b than in setup-a. In setup-a, the CPU utilization went as high as 60-70% and after sometime when the merging kicked in, I started to see builk request rejections. However, in setup-b, the CPU utilization always remained < 10% from the start.

The difference is amuzing. Hope you had a chance to see the conversation I had a year back with Mike. He had suggested these setting then. Could there be anything I'm missing? These results suggest that these settings really make a huge difference.