High RAM Usage Across Cluster

Hello again, we are continuing to test out our new elasticsearch cluster. I'm trying to understand the following:

  • Why is ram usage always so high? (it doesn't ever decrease, even when nobody is searching or indexing). The only way to get ram usage back to a lower level is to reboot the server the node is on (a service restart does not suffice).
  • We overwhelmed our cluster over the weekend and it caused 4 data nodes to crash (logs say heap ran out). As a result, we have some unassigned shards. Is this something we need to fix or does ES correct this for you?
  • We are dividing our time-series data into monthly indices. If we don't provide a mapping, these indices are roughly 40gb-50gb in size; with a mapping it can be as much as 40% smaller in size. What is an indeal index size, and would two shards be ideal?

I realize I'm asking for a lot of information. Any help is always appreciated.

Java Heap size is about 40% of total ram per node.

Here is the output of the _cat/nodes API.

ip              heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
xxx.xxx.xxx.xxx           67          99   0    0.00    0.02     0.05 r         -      odsts-coord2
xxx.xxx.xxx.xxx           46          99   0    0.01    0.03     0.05 dr        -      odsts-data2
xxx.xxx.xxx.xxx            6          44   0    0.00    0.02     0.05 dr        -      odsts-data10
xxx.xxx.xxx.xxx           27          99   0    0.00    0.01     0.05 dr        -      odsts-data1
xxx.xxx.xxx.xxx           60          99   0    0.07    0.03     0.05 ir        -      odsts-ingest1
xxx.xxx.xxx.xxx           12          30   0    0.00    0.02     0.05 mr        -      odsts-master3
xxx.xxx.xxx.xxx            8          61   0    0.00    0.01     0.05 ir        -      odsts-ingest4
xxx.xxx.xxx.xxx           11          30   0    0.00    0.01     0.05 mr        -      odsts-master2
xxx.xxx.xxx.xxx           49          32   0    0.00    0.01     0.05 r         -      odsts-coord1
xxx.xxx.xxx.xxx            7          61   0    0.00    0.01     0.05 ir        -      odsts-ingest3
xxx.xxx.xxx.xxx           31          99   0    0.00    0.01     0.05 dr        -      odsts-data3
xxx.xxx.xxx.xxx           35          99   0    0.00    0.01     0.05 dr        -      odsts-data4
xxx.xxx.xxx.xxx           38          46   0    0.00    0.01     0.05 ir        -      odsts-ingest2
xxx.xxx.xxx.xxx           24          99   0    0.01    0.03     0.05 dr        -      odsts-data9
xxx.xxx.xxx.xxx           17          30   0    0.02    0.02     0.05 mr        *      odsts-master1
xxx.xxx.xxx.xxx            7          44   0    0.01    0.03     0.05 dr        -      odsts-data6
xxx.xxx.xxx.xxx            7          44   0    0.00    0.01     0.05 dr        -      odsts-data8
xxx.xxx.xxx.xxx           43          99   0    0.02    0.02     0.05 dr        -      odsts-data5
xxx.xxx.xxx.xxx            4          99   0    0.00    0.01     0.05 dr        -      odsts-data7

Here is the output of _cluster/health:

{
  "cluster_name" : "odsts",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 19,
  "number_of_data_nodes" : 10,
  "active_primary_shards" : 59,
  "active_shards" : 123,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 3,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 97.61904761904762
}

Here is the output of _cat/indices. For the indices that start with ppYYYYMM we provided a mapping and thus saw a very large size reduction. The indices that start with haoYYYYMM have a dynamic mapping. The remaining indices can be ingnored.

health status index                        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   hao201904                    4denZNJ6ToqvKDXQBjsrrw   2   1   85859781            0     32.4gb         21.4gb
green  open   hao201905                    jnitbogERT-KR96De-W0hw   2   1   76985816            0     40.8gb         20.5gb
green  open   hao201906                    _4cDT-RLR4uaXOTZdHTEig   2   1   80081591            0     36.7gb         18.3gb
green  open   hao201907                    sxks1YHFQVCD5GuPWiOOGg   2   1  104374341            0     44.2gb         21.9gb
green  open   hao201908                    y1vLJv9MRK-_GA7JK8-yTw   2   1   94266038            0     41.9gb         20.9gb
green  open   hao201909                    d1ZKXbBHQmKHvKBpcgzwyg   2   1   85904737            0     39.4gb         19.7gb
green  open   pp202006                     ht8QQGtZQmafZyZqVuCCYg   2   1  100279225            0     26.3gb         13.1gb
green  open   pp202005                     2wYMv3E6Rt27qIbztk0rKw   2   1  112902847            0     55.3gb         27.6gb
green  open   .opendistro_security         9czsdXZVTtqewMj9KadRGw   1   9          0            0        2kb           208b
green  open   hao202001                    gVhivL8wQdaZfiR-GQn9IA   2   1   96612513            0     51.3gb         25.6gb
yellow open   hao202002                    bUH2ei0bQQi_jIwDqv69IA   2   1   91103663            0     40.5gb         26.9gb
green  open   hao202003                    hkyoHpljQuObzwYLUTxrqw   2   1   99579191            0     58.8gb         29.4gb
green  open   hao202004                    xFoSm4SkSTSyvJ0LxOoNHA   2   1  113157011            0     63.9gb         31.9gb
green  open   hao202005                    ZYLkztXiQ0SKrsE08be02w   2   1  112902847            0     63.1gb         31.5gb
green  open   hao202006                    scIvWRBVTgORTChqtM1IJw   2   1  100279225            0     45.4gb         22.7gb
green  open   hao202007                    7BH8fFfiTMy9bs0IXhy0Lg   2   1  111608062            0     50.3gb         25.1gb
yellow open   hao201910                    CjCLQzqEQ3iwIwYRnhNVIQ   2   1   93440898            0     31.9gb         21.3gb
green  open   hao201911                    8Edeexc1Tamga6RaK1OYBA   2   1   89519843            0     41.9gb         20.9gb
green  open   hao201912                    o-UXOKmYSHi7r6_K4MqR7A   2   1   92361624            0       41gb         20.4gb
green  open   .kibana_1                    u32C8sxHRiKohRb1mA9DVg   1   1          0            0       416b           208b
green  open   pp202002                     _BWl0EmaSaOe22WqzCXVDA   2   1   91103663            0     30.2gb         15.1gb
green  open   proofpoint-201209            A2Yy2YwSRbmOP2UGSQyVaQ   2   1          0            0       832b           416b
green  open   proofpoint-201208            -F5VdXmtTz-NhY3vPDD8kg   2   1          0            0       832b           416b
green  open   proofpoint-201207            6ArJI3L5TveKkoMc0k73MA   2   1          0            0       832b           416b
green  open   proofpoint-201206            cvpXPVRYQGid_ZMJbUWqRw   2   1          0            0       832b           416b
green  open   proofpoint-201205            TGoFhd-qSmanvEcB2X68Jg   2   1          0            0       832b           416b
green  open   proofpoint-201204            w42iNOL3T0-zcegyPRJQXA   2   1     262183            0     68.4mb         34.2mb
green  open   hao201901                    rrLcsdidQ8S-emPUD5Za1g   2   1   84561811            0     43.7gb         21.8gb
green  open   hao201902                    PxPe14-aSrCQzBXsXFhsHA   2   1   75577153            0     39.2gb         19.6gb
green  open   security-auditlog-2020.08.21 ZRBtRRm7SH2LaMX1XcK_4A   1   1          0            0       416b           208b
green  open   hao201903                    a0bVZoOmQvm44M5MTw5KQQ   2   1   87647892            0     44.4gb         22.4gb

Looks like you are using open distro? If so you might need to look into their monitoring metrics to identify if it's query rates or something else.

1 Like

RAM usage includes memory used by the operating system page cache, so having this at or close to 100% is not a problem as this just means that you have enough data in the cluster to fill the cache. If memory is needed by processes, the size of the page cache will shrink and memory be made available. Your heap usage and shard sizes look fine so I see no problems with this cluster.

1 Like

Thanks for the advice, do you have any guidance on how to resolve unassigned shards? I noticed my cluster health is still yellow after we had some nodes crash, which appears to be due to certain indices being yellow. How does a shard get assigned?

Why is the node crashing? What do the logs show?

We were doing several bulk requests in parallel to see how much it could handle for indexing. I think it just ran out of heap space and gave up. I'm just trying to understand how to get the cluster back to green at this point.

java.lang.OutOfMemoryError: Java heap space
        at java.lang.Integer.valueOf(Integer.java:1065) ~[?:?]
        at sun.nio.ch.EPollSelectorImpl.processEvents(EPollSelectorImpl.java:194) ~[?:?]
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:137) ~[?:?]
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:129) ~[?:?]
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:146) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:803) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.49.Final.jar:4.1.49.Final]
        at java.lang.Thread.run(Thread.java:832) [?:?]
[2020-09-11T23:24:29,870][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [odsts-data7] fatal error in thread [elasticsearch[odsts-data7][write][T#14]], exiting
java.lang.OutOfMemoryError: Java heap space
[2020-09-11T23:24:39,939][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [odsts-data7] fatal error in thread [elasticsearch[odsts-data7][generic][T#28]], exiting
java.lang.OutOfMemoryError: Java heap space

You can check the _cat/recovery API to see what is happening to those last shards.

What version are you running again?

@warkolm open distro with elasticsearch 7.8. I didn't use _cat/recovery but I was able to use the _cluster/reroute API to get the shards into an INITIALIZING state. Hopefully that's all that's needed.