Huge memory usage because of SnapshotShardFailure error objects on master node

We are seeing a considerable memory consumption on master nodes.

ES Version: 6.5.4
The number of shards is close to 45k.
The number of data nodes is 9.

The memory on data nodes is 32GB, and the ES Heap size is 18GB.

I took the heap dump of the master node, and I am seeing a lot of objects being allotted for SnapshotInfo with SnapshotShardFailure objects being allocated. Approximately 50% of memory has been allocated to these kinds of objects. The next 21% of the memory is being used by "RespositoryData" objects.


Class Name                                                              | Shallow Heap | Retained Heap | Percentage
--------------------------------------------------------------------------------------------------------------------
<Java Local> org.elasticsearch.repositories.RepositoryData @ 0x71f8b66c0|           40 |  54,57,86,072 |     21.97%
'- indexSnapshots java.util.Collections$UnmodifiableMap @ 0x72189fb10   |      0.00 MB |     514.42 MB |     21.72%
--------------------------------------------------------------------------------------------------------------------

Errors look like this.

java.lang.Thread @ 0x6df0fdfc0  elasticsearch[es-prod-analytics-master01][generic][T#7] Thread                                                                                                                                                                                                                                                    |          120 | 1,74,56,35,208 |     70.28%
'- <Java Local> java.util.HashSet @ 0x71ee0b6c8                                                                                                                                                                                                                                                                                                   |           16 | 1,19,51,13,008 |     48.12%
   '- map java.util.HashMap @ 0x71ff423e8                                                                                                                                                                                                                                                                                                         |           48 | 1,19,51,12,992 |     48.12%
      '- table java.util.HashMap$Node[1024] @ 0x7387ac150                                                                                                                                                                                                                                                                                         |      0.00 MB |    1,139.75 MB |     48.12%
         '- java.util.HashMap$Node @ 0x706da51b0                                                                                                                                                                                                                                                                                                  |      0.00 MB |        9.06 MB |      0.38%
            '- key org.elasticsearch.snapshots.SnapshotInfo @ 0x6ef29b4a0                                                                                                                                                                                                                                                                         |      0.00 MB |        9.06 MB |      0.38%
               '- shardFailures java.util.Collections$UnmodifiableRandomAccessList @ 0x6f1938a70                                                                                                                                                                                                                                                  |      0.00 MB |        6.87 MB |      0.29%
                  '- list,c java.util.ArrayList @ 0x6f1938a88                                                                                                                                                                                                                                                                                     |      0.00 MB |        6.87 MB |      0.29%
                     '- elementData java.lang.Object[2776] @ 0x6ff390928                                                                                                                                                                                                                                                                          |      0.01 MB |        6.87 MB |      0.29%
                        '- org.elasticsearch.snapshots.SnapshotShardFailure @ 0x6ec711bd8                                                                                                                                                                                                                                                         |      0.00 MB |        0.00 MB |      0.00%
                           |- cause org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException @ 0x6eaf899f8                                                                                                                                                                                                                             |      0.00 MB |        0.00 MB |      0.00%
                           |  |- detailMessage java.lang.String @ 0x6ff019b58  IndexShardSnapshotFailedException[failed to list blobs]; nested: IOException[Exception when listing blobs by prefix [null]]; nested: SdkClientException[Unable to execute HTTP request: Connect to prod-backups.s3.us-west-2.amazonaws.com:443 [<COMPANY>-...|

The last snapshot which encountered failure was on 5th May 2022 (snapshot_220505080001) as given below, so I don't understand how and why these objects shall be alive on the heap dump taken in the last couple of days?

snapshot_220427040001 SUCCESS 1651032004  04:00:04   1651035732 05:02:12       1h   12190             14610             0        14610
snapshot_220427160001 SUCCESS 1651075203  16:00:03   1651078640 16:57:20    57.2m   12216             14636             0        14636
snapshot_220428080001 SUCCESS 1651132804  08:00:04   1651135609 08:46:49    46.7m   12217             14637             0        14637
snapshot_220503080001 SUCCESS 1651564804  08:00:04   1651567738 08:48:58    48.9m   12261             14681             0        14681
snapshot_220504080001 SUCCESS 1651651204  08:00:04   1651654312 08:51:52    51.8m   12279             14699             0        14699
snapshot_220505080001 PARTIAL 1651737604  08:00:04   1651743791 09:43:11     1.7h   12301             13690          1031        14721
snapshot_220508080001 SUCCESS 1651996803  08:00:03   1652001056 09:10:56     1.1h   12330             14742             0        14742
snapshot_220509080001 SUCCESS 1652083203  08:00:03   1652086912 09:01:52       1h   12332             14752             0        14752
snapshot_220513080001 SUCCESS 1652428805  08:00:05   1652433041 09:10:41     1.1h   12484             14904             0        14904
snapshot_220516080001 SUCCESS 1652688004  08:00:04   1652692146 09:09:06     1.1h   12480             14900             0        14900

How can I get rid of these?

What version are you using? Please always include this detail when asking questions.

Updated in the original question.
The ES version is 6.5.4.

Ok, 6.5 is very very old, you're missing out on literally years of new development. I believe this is no longer a problem in recent versions.

I understand. We shall be upgrading in the next six months, but as of now, we need to put some fires here.
Any insight that can be used for this issue? Will appreciate.

I don't have anything else to suggest sorry. This version was released in 2018 and became unsupported over two years ago, I don't even have a development environment that can open it any more.

1 Like