Elasticsearch (6.8) is causing server load

Hello everyone,

I'm having strange issue with elasticsearch. Basically it's causing high server load occasionally on one of three nodes in a cluster. I mean, considering the amount of CPUs (2 physical, 16 logical) it's not much, but it certainly should not be at 2 when "idle" !

Cluster is consisted of following nodes:
node1 (currently master) - VPS Server, 4vCPU, 4GB RAM (2 allocated for heap), 100GB HDD
node2 - Physical Server, 2 Physical CPU's, 8 Logical, 16G RAM (4 allocated for heap), data stored on ZFS pool without compression (mounted on /var/lib/elasticsearch)
node3 - Physical Server, 2 Physical CPU's, 16 Logical, 12G RAM (4 allocated for heap), data stored on ZFS pool without compression (mounted on /var/lib/elasticsearch)

node3 is the one having issue. Like mentioned, it's only happening occasionally but it's always higher than the other nodes. Setup is quite similar between two physical machines don't know why is node3 having issue, even though it has better CPU (It's utilizing Xeon E5620).
At this moment load average is: 1.49, 1.11, 1.34.
Again, it's not much considering amount of CPUs but still...

Server isn't doing anything except data backups via rsync after midnight.
Last few lines of elasticsearch.log is

[2019-05-23T00:12:05,452][INFO ][o.e.m.j.JvmGcMonitorService] [es-backup11] [gc][38642] overhead, spent [309ms] collecting in the last [1s]
[2019-05-23T00:22:07,460][INFO ][o.e.m.j.JvmGcMonitorService] [es-backup11] [gc][39242] overhead, spent [283ms] collecting in the last [1s]
[2019-05-23T01:05:52,126][WARN ][o.e.i.f.SyncedFlushService] [es-backup11] [postfix-2019.05.22][1] can't to issue sync id [KUSSD9WXRRObhNa7XOtNrw] for out of sync replica [[postfix-2019.05.22][1], node[xTQsIj8CQsajIjcWMVeY4A], [R], s[STARTED], a[id=2V2d3MBwSHqWQaXkyE8o4w]] with num docs [5222]; num docs on primary [5223]
[2019-05-23T02:47:55,109][WARN ][o.e.i.f.SyncedFlushService] [es-backup11] [postfix-2019.05.23][2] can't to issue sync id [0R-RCWcGTLGrk6V-6i-naw] for out of sync replica [[postfix-2019.05.23][2], node[LpcI-a41QSW2uOMgyz2hDA], [R], s[STARTED], a[id=rZWN-plaRKaTa3A8eHUfLA]] with num docs [33]; num docs on primary [35]
[2019-05-23T03:57:17,454][WARN ][o.e.i.f.SyncedFlushService] [es-backup11] [postfix-2019.05.23][2] can't to issue sync id [AniftNijTZG3q_bw7FlQLw] for out of sync replica [[postfix-2019.05.23][2], node[LpcI-a41QSW2uOMgyz2hDA], [R], s[STARTED], a[id=rZWN-plaRKaTa3A8eHUfLA]] with num docs [75]; num docs on primary [76]
[2019-05-23T06:55:52,416][WARN ][o.e.i.f.SyncedFlushService] [es-backup11] [postfix-2019.05.23][2] can't to issue sync id [FOzcrf94SriQqFnDAQkLhQ] for out of sync replica [[postfix-2019.05.23][2], node[LpcI-a41QSW2uOMgyz2hDA], [R], s[STARTED], a[id=rZWN-plaRKaTa3A8eHUfLA]] with num docs [276]; num docs on primary [278]

gc.log:
2019-05-23T13:41:49.899+0200: 87261.863: Total time for which application threads were stopped: 0.0135308 seconds, Stopping threads took: 0.0000932 seconds
2019-05-23T13:41:49.900+0200: 87261.864: Total time for which application threads were stopped: 0.0008304 seconds, Stopping threads took: 0.0001691 seconds
2019-05-23T13:42:03.292+0200: 87275.256: Total time for which application threads were stopped: 0.0010224 seconds, Stopping threads took: 0.0001263 seconds
2019-05-23T13:42:11.529+0200: 87283.493: Total time for which application threads were stopped: 0.0010310 seconds, Stopping threads took: 0.0001883 seconds
2019-05-23T13:42:16.549+0200: 87288.513: [GC (Allocation Failure) 2019-05-23T13:42:16.549+0200: 87288.513: [ParNew
Desired survivor size 56688640 bytes, new threshold 6 (max 6)
- age 1: 9271408 bytes, 9271408 total
- age 2: 1391440 bytes, 10662848 total
- age 3: 506176 bytes, 11169024 total
- age 4: 362848 bytes, 11531872 total
- age 5: 49032 bytes, 11580904 total
- age 6: 11864 bytes, 11592768 total
: 903442K->16266K(996800K), 0.0164781 secs] 2304439K->1417360K(4083584K), 0.0167310 secs] [Times: user=0.17 sys=0.01, real=0.02 secs]

Number of shards in cluster 406, with 91 indicies // 11GB of data.
Any tweaking if available is suggested and if any additional info i need to provide, let me know.

Thanks in advance.

maybe you should take care about your shards repartition in indexing ?
Does your shards are spread correctly ?

Well, not even sure how do i look check that.
But node1 and node3 (the one that is having issue with load) have 135 shards.
node2 has 136.
I've set in templates 3 shards per index, index creation is on daily basis.
There are still some leftovers indices that have 1 shard, but i should re-index that data soon, when i merge them into monthly index.

How many shards should i have? 1 per node or more? Indices are not big in size. They span in 3-100mb in size on a daily basis, depends on what data they have stored.

do you know how many shards in write you have ?

To find out what a node is busy doing, I would start by looking at the nodes hot thread API:

GET _nodes/hot_threads                # <-- just the top three threads on each node
GET _nodes/hot_threads?threads=999999 # <-- all of them

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.