We are seeing a considerable memory consumption on master nodes.
ES Version: 6.5.4
The number of shards is close to 45k.
The number of data nodes is 9.
The memory on data nodes is 32GB, and the ES Heap size is 18GB.
I took the heap dump of the master node, and I am seeing a lot of objects being allotted for SnapshotInfo
with SnapshotShardFailure
objects being allocated. Approximately 50% of memory has been allocated to these kinds of objects. The next 21% of the memory is being used by "RespositoryData" objects.
Class Name | Shallow Heap | Retained Heap | Percentage
--------------------------------------------------------------------------------------------------------------------
<Java Local> org.elasticsearch.repositories.RepositoryData @ 0x71f8b66c0| 40 | 54,57,86,072 | 21.97%
'- indexSnapshots java.util.Collections$UnmodifiableMap @ 0x72189fb10 | 0.00 MB | 514.42 MB | 21.72%
--------------------------------------------------------------------------------------------------------------------
Errors look like this.
java.lang.Thread @ 0x6df0fdfc0 elasticsearch[es-prod-analytics-master01][generic][T#7] Thread | 120 | 1,74,56,35,208 | 70.28%
'- <Java Local> java.util.HashSet @ 0x71ee0b6c8 | 16 | 1,19,51,13,008 | 48.12%
'- map java.util.HashMap @ 0x71ff423e8 | 48 | 1,19,51,12,992 | 48.12%
'- table java.util.HashMap$Node[1024] @ 0x7387ac150 | 0.00 MB | 1,139.75 MB | 48.12%
'- java.util.HashMap$Node @ 0x706da51b0 | 0.00 MB | 9.06 MB | 0.38%
'- key org.elasticsearch.snapshots.SnapshotInfo @ 0x6ef29b4a0 | 0.00 MB | 9.06 MB | 0.38%
'- shardFailures java.util.Collections$UnmodifiableRandomAccessList @ 0x6f1938a70 | 0.00 MB | 6.87 MB | 0.29%
'- list,c java.util.ArrayList @ 0x6f1938a88 | 0.00 MB | 6.87 MB | 0.29%
'- elementData java.lang.Object[2776] @ 0x6ff390928 | 0.01 MB | 6.87 MB | 0.29%
'- org.elasticsearch.snapshots.SnapshotShardFailure @ 0x6ec711bd8 | 0.00 MB | 0.00 MB | 0.00%
|- cause org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException @ 0x6eaf899f8 | 0.00 MB | 0.00 MB | 0.00%
| |- detailMessage java.lang.String @ 0x6ff019b58 IndexShardSnapshotFailedException[failed to list blobs]; nested: IOException[Exception when listing blobs by prefix [null]]; nested: SdkClientException[Unable to execute HTTP request: Connect to prod-backups.s3.us-west-2.amazonaws.com:443 [<COMPANY>-...|
The last snapshot which encountered failure was on 5th May 2022 (snapshot_220505080001) as given below, so I don't understand how and why these objects shall be alive on the heap dump taken in the last couple of days?
snapshot_220427040001 SUCCESS 1651032004 04:00:04 1651035732 05:02:12 1h 12190 14610 0 14610
snapshot_220427160001 SUCCESS 1651075203 16:00:03 1651078640 16:57:20 57.2m 12216 14636 0 14636
snapshot_220428080001 SUCCESS 1651132804 08:00:04 1651135609 08:46:49 46.7m 12217 14637 0 14637
snapshot_220503080001 SUCCESS 1651564804 08:00:04 1651567738 08:48:58 48.9m 12261 14681 0 14681
snapshot_220504080001 SUCCESS 1651651204 08:00:04 1651654312 08:51:52 51.8m 12279 14699 0 14699
snapshot_220505080001 PARTIAL 1651737604 08:00:04 1651743791 09:43:11 1.7h 12301 13690 1031 14721
snapshot_220508080001 SUCCESS 1651996803 08:00:03 1652001056 09:10:56 1.1h 12330 14742 0 14742
snapshot_220509080001 SUCCESS 1652083203 08:00:03 1652086912 09:01:52 1h 12332 14752 0 14752
snapshot_220513080001 SUCCESS 1652428805 08:00:05 1652433041 09:10:41 1.1h 12484 14904 0 14904
snapshot_220516080001 SUCCESS 1652688004 08:00:04 1652692146 09:09:06 1.1h 12480 14900 0 14900
How can I get rid of these?