ES crashing with OutOfMemory exception while reading Translog of 450 GB

We have cluster with large number of shards with 2 replicas across multiple indices. We recently did a rolling upgrade from ES 5.6.10 to 6.2.4.
Couple of days back, cluster went to yellow as node es-d39-rm was trying to relocate shard [39] to some other node. This operation failed with logs saying -

[es-d39-rm] [af9ea1c9-f42f-4a99-aa68-06194e9884b8_0][39] timed out waiting for relocation hand-off to complete

After sometime, ES on other 2 nodes that had 1 primary (es-d53-rm) and 1 replica (es-d58-rm) crashed with exception -

[es-d53-rm] fatal error in thread [elasticsearch[es-d53-rm][generic][T#2]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:145) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:120) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog$Index.(Translog.java:954) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog$Index.(Translog.java:931) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog$Operation.readOperation(Translog.java:883) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1432) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.BaseTranslogReader.read(BaseTranslogReader.java:103) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.TranslogSnapshot.readOperation(TranslogSnapshot.java:73) ~[elasticsearch-6.2.4.jar:6.2.4]
...
org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[elasticsearch-6.2.4.jar:6.2.4]
... more

On investigation, we found -

After node es-d39-rm relocation hand-off failed, it went to node es-d58-rm for recovery and then, ES on es-d58-rm crashed with OutOfMemory exception.
Then, es-d39-rm went for recovery to es-d53-rm and same happened again. Now, no shard is active for shard [39] and cluster is Red. Logs seen for es-d39-rm said that recovery from es-d58-rm (or es-d53-rm) failed.

[es-d39-rm] [[af9ea1c9-f42f-4a99-aa68-06194e9884b8_0][39]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [af9ea1c9-f42f-4a99-aa68-06194e9884b8_0][39]: Recovery failed from {es-d53-rm}{J42mCdMgTEuGwxiL2R-Z0Q}{BkIcPa1fQcSK9NMVIXvyxQ}{192.168.0.203}{192.168.0.203:9300}{faultDomain=2, updateDomain=14} into {es-d39-rm}{aAghuvd7TIyJ9gMK5E2NeQ}{hbGSBowqQ3urleUA4l5kOw}{192.168.0.189}{192.168.0.189:9300}{faultDomain=0, updateDomain=0} (no activity after [30m])
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:286) [elasticsearch-6.2.4.jar:6.2.4]
... more

es-d53-rm took 12+ hours to bring shard [39] from Initializing to Started state. On further investigation -

  1. Translog size for shard [39] on 3 nodes -
    es-d53-rm = 450 GB (primary)
    es-d58-rm = 360 GB
    es-d39-rm = 360 GB

  2. OutOfMemory exception trace shows ES crashed while reading Translog. And this was happening when es-d39-rm wanted to recover shard [39].

  3. During recovery, high gc was also observed. We are using G1GC.
    image

  4. There was no active indexing happening on this index. We stopped indexing on this index around 4 days back. But, indexing was in progress on other indices on these nodes.

  5. When primary shard came back to Started, this is what I noticed with recovery stats (pasting result only for primary shard) -

GET _recovery?filter_path=af9ea1c9-f42f-4a99-aa68-06194e9884b8_0

{
"id": 39,
"type": "EXISTING_STORE",
"stage": "DONE",
"primary": true,
"start_time_in_millis": 1545681139643,
"stop_time_in_millis": 1545714750079,
"total_time_in_millis": 33609069,
"source": {},
"target": {
"id": "J42mCdMgTEuGwxiL2R-Z0Q",
"host": "192.168.0.203",
"transport_address": "192.168.0.203:9300",
"ip": "192.168.0.203",
"name": "es-d53-rm"
},
"index": {
"size": {
"total_in_bytes": 45691573451,
"reused_in_bytes": 45691573451,
"recovered_in_bytes": 0,
"percent": "100.0%"
},
"files": {
"total": 169,
"reused": 169,
"recovered": 0,
"percent": "100.0%"
},
"total_time_in_millis": 22,
"source_throttle_time_in_millis": 0,
"target_throttle_time_in_millis": 0
},
"translog": {
"recovered": 8611799,
"total": 10701365,
"percent": "80.5%",
"total_on_start": 10701365,
"total_time_in_millis": 33608958**
},
"verify_index": {
"check_index_time_in_millis": 0,
"total_time_in_millis": 0
}
}

After cluster became Red, primary shard on es-d53-rm went into Initializing. We waited for it to recover and it was very slow. As it was going to take hours, our assumption was, Translog would be of much less size after recovery and then, probably, replicas will also be able to recover. Unfortunately, this is happening now in loop and ES crashed 3rd time. Primary goes to Started, replica tries to recover and ES crashes.

Note that, the shard size is ~42 GB but has 450 GB of Translog. We have 16 core, 56 GB memory machine and half of the memory is allocated to Heap.

In attempt to make cluster Green, 1 idea was to make replica count 0 and let primary become Started. Assuming Translog is kept by primary to help replicas recover, it won't be retained now when replica count is 0. Once Translog gets cleared, set replica count back to 2. But -

  1. There is a risk of cluster going back to Red again if a node goes down when replica count is 0.
  2. Due to [5] above, not sure if Translog would be cleared after primary shard becomes Started.

Cluster Health while Primary shard is Initializing -

{
"cluster_name": "****",
"status": "red",
"timed_out": false,
"number_of_nodes": 38,
"number_of_data_nodes": 27,
"active_primary_shards": 978,
"active_shards": 2933,
"relocating_shards": 0,
"initializing_shards": 1,
"unassigned_shards": 2,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 99.89782016348773
}

Please help on how can we make Cluster back to Green and how can we prevent the Translog size at first place, to grow so huge.

Hello,

Inorder to make your cluster green, use,

PUT index_name/_settings
{
"index" : {
"number_of_replicas" : 0
}
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.