ES busy : Hot thread and xpack.monitoring.exporter full errors

Beuhlet_Reseau · October 14, 2020, 6:16pm

Hello,

I see many errors in hot thread API.

Overview
Version: 5.5.1
Uptime: 13 days

Nodes: 60
Disk Available: 247TB / 413TB  (59.74%)
JVM Heap: 68.38%  (1TB / 2TB)

Indices: 2,857
Documents: 24,414,223,409
Disk Usage: 154TB
Primary Shards: 24,039
Replica Shards: 25,139

Too many many errors on many org.elasticsearch.xpack.monitoring.exporter.ExportException too in master log.

My cluster seems busy ? but i don't understand why.

::: {opbdf1019_data_02}{050boVO5RsaA6oVA4psUzA}{fjEK3eFjQ_yJ-CK4fAOUmA}{10.79.18.163}            {10.79.18.163:9302}{rack_id=BB_Prod05}
Hot threads at 2020-10-14T16:09:35.356Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

4.4% (21.9ms out of 500ms) cpu usage by thread 'elasticsearch[opbdf1019_data_02][bulk][T#27]'
 4/10 snapshots sharing following 28 elements
   org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:373)
   org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:93)
   org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:66)
   org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:277)
   org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:529)
   org.elasticsearch.index.shard.IndexShard.prepareIndexOnReplica(IndexShard.java:518)

   org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShard.java:1673)
   org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:566)org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
   org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   java.lang.Thread.run(Thread.java:748)
 3/10 snapshots sharing following 33 elements
   org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:447)
   org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403)
   org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
   org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)
   org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1571)
   org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1316)
   org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:663)
   org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:607)
   org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:505)
   org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:556)
   org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:545)
   org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnReplica(Transpororg.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147)
   org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShard.java:1673)
   org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:566)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:451)
   org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:441)
 org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
   org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
   org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   java.lang.Thread.run(Thread.java:748)
 3/10 snapshots sharing following 21 elements
   org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:376)
   org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:69)
   org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:494)
   org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:467)
   org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147)
   org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShard.java:1673)
   org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:566)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:451)
   org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:441)
   com.floragunn.searchguard.ssl.transport.SearchGuardSSLRequestHandler.messageReceivedDecorate(SearchGuardSSLRequestHandler.java:178)
   com.floragunn.searchguard.transport.SearchGuardRequestHandler.messageReceivedDecorate(SearchGuardRequestHandler.java:192)
   com.floragunn.searchguard.ssl.transport.SearchGuardSSLRequestHandler.messageReceived(SearchGuardSSLRequestHandler.java:140)
   com.floragunn.searchguard.SearchGuardPlugin$3$1.messageReceived(SearchGuardPlugin.java:376)
   org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
   org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
   org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   java.lang.Thread.run(Thread.java:748)

     2.9% (14.6ms out of 500ms) cpu usage by thread 'elasticsearch[opbdf1019_data_02][[z_app_2ip_es_socle_cdr-20201014][13]: Lucene Merge Thread #875]'
     10/10 snapshots sharing following 17 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
       java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
       org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
       org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
       org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
       org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
       org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:72)
       org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:278)
       org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:620)
       org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:200)
       org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:89)
       org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
       org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931)
       org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
       org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:99)
       org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661)

::: {opbdf1717_data_04}{Eh_JC8TfQgyfiXaxz0jQzg}{oBDFBUqeS3aAECc6Tr-7Ag}{10.79.20.17}{10.79.20.17:9304}{rack_id=BB_Prod09}
   Hot threads at 2020-10-14T16:09:35.355Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {opbdf1219_data_03}{cuwbNmH6TYu_Drc3ZwgoDg}{3cg6CfTHR3q7JFpBrqhFrQ}{10.79.18.125}{10.79.18.125:9303}{rack_id=BB_Prod06}
   Hot threads at 2020-10-14T16:09:35.362Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   96.8% (483.9ms out of 500ms) cpu usage by thread 'elasticsearch[opbdf1219_data_03][management][T#5]'
     10/10 snapshots sharing following 17 element

::: {opbdf1118_master-adm_90}{VYCNOoLwQBK4r-I6OKftaw}{DxVf4OezRauzYFwGq-zl8g}{10.79.18.241}{10.79.18.241:9390}{rack_id=BB_Prod06}
   Hot threads at 2020-10-14T16:09:35.354Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {opbdf0416_data_04}{1XHD-l0FTsiRR94FMiWalQ}{5DLi5iXRTFySy0ZnWcUu_w}{10.79.18.106}{10.79.18.106:9304}{rack_id=BB_Prod02}
   Hot threads at 2020-10-14T16:09:35.356Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {opbdf1620_master_90}{-Ye9TCK6QFm7PZz-NiVhEQ}{Bhfu3rLERmqdJjyrqlBUpw}{10.79.18.238}{10.79.18.238:9390}{rack_id=BB_Prod08}
   Hot threads at 2020-10-14T16:09:35.355Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {opbdf1316_coord_00}{RgLcdr5KSZymIrs0WM_VCQ}{dnXCgAzRTNuKSJqQltNVWQ}{10.79.18.234}{10.79.18.234:9300}{rack_id=BB_Prod07}
   Hot threads at 2020-10-14T16:09:35.358Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

    ::: {opbdf1820_coord_00}{LWVbOumgT4quBdBbLmvaTA}{J03be4cKQXCZaS0pJNavQQ}{10.79.20.38}{10.79.20.38:9300}{rack_id=BB_Prod09}
   Hot threads at 2020-10-14T16:09:35.356Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

     ::: {opbdf1716_data_02}{UT1F0DLvQw68M2gwVYsXmw}{4Srl_Q25QLqM16wGss58Sg}{10.79.20.16}{10.79.20.16:9302}{rack_id=BB_Prod09}
    Hot threads at 2020-10-14T16:09:35.356Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

    1.1% (5.7ms out of 500ms) cpu usage by thread 'elasticsearch[opbdf1716_data_02][[z_app_2ip_es_socle_cdr-20201014][35]: Lucene Merge Thread #640]'
     10/10 snapshots sharing following 7 elements
       org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:200)
       org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:89)
       org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
       org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931)
       org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
       org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:99)
       org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661)

grumo35 · October 14, 2020, 6:27pm

What's your storage setup ?

Does your cpu looks busy while getting this error ?

Do you have any system warning in messages.log or anything else ?

Beuhlet_Reseau · October 15, 2020, 12:19pm

Its normal storage (mechanic)

Cpu seems not all times busy.

I have 2 disks on failure on 8 nodes.

grumo35 · October 15, 2020, 12:38pm

Can you please provide systat and iostat of your system ? Just to make sure everything looks clear.

What do you mean exactly by that ? do you have hard drive failures ?

Beuhlet_Reseau · October 15, 2020, 1:34pm

Re,

iostat output :

Linux 3.10.0-957.5.1.el7.x86_64 (opbdf1620)     10/15/2020      _x86_64_        (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.88    0.00    1.43    0.22    0.00   90.47

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00    74.99   55.88   59.07  2408.78  8317.13   186.62     0.10    0.87    3.24    2.14   0.17   1.98
sda               0.00     1.86    0.53    7.04    16.03    79.04    25.09     0.02    3.12    8.82    2.69   0.10   0.08
sdb               0.00    72.64   52.42   57.36  2311.66  7997.91   187.81     0.05    0.45    2.65    2.06   0.17   1.88
dm-0              0.00     0.00    0.20    0.12     2.50     0.50    18.80     0.00    1.81    2.77    0.28   0.53   0.02
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     9.44     0.00    3.04    3.04    0.00   3.00   0.00
dm-2              0.00     0.00   52.46  130.01  2311.66  7997.91   113.00     0.07    0.41    2.81    1.03   0.10   1.89
dm-3              0.00     0.00   55.92  134.06  2408.78  8317.13   112.92     0.13    0.68    3.43    1.07   0.10   1.98
dm-4              0.00     0.00    0.00    0.00     0.00     0.01     9.06     0.00    1.80    5.57    0.07   0.43   0.00
dm-5              0.00     0.00    0.01    0.01     0.38     0.03    62.49     0.00    3.43    6.59    0.09   0.60   0.00
dm-6              0.00     0.00    0.03    2.56     0.53    14.55    11.62     0.00    0.46    5.89    0.38   0.06   0.02
dm-7              0.00     0.00    0.00    0.00     0.00     0.00    22.90     0.00    3.26    3.78    0.01   3.08   0.00
dm-8              0.00     0.00    0.00    0.00     0.00     0.00    21.44     0.00    2.84    3.34    0.01   2.65   0.00
dm-9              0.00     0.00    0.02    0.00     0.56     0.01    51.47     0.00    3.86    4.40    0.31   1.35   0.00
dm-10             0.00     0.00    0.10    0.32     8.50    28.01   171.00     0.05  117.52   18.44  148.96   0.45   0.02
dm-11             0.00     0.00    0.10    1.56     1.52     6.86    10.11     0.00    1.08   14.95    0.23   0.08   0.01
dm-12             0.00     0.00    0.00    0.23     0.01     0.24     2.16     0.00    0.29   16.77    0.14   0.05   0.00
dm-13             0.00     0.00    0.00    0.00     0.00     0.00    10.33     0.00    1.70    2.26    0.09   1.14   0.00
dm-14             0.00     0.00    0.00    0.00     0.13     0.37   159.29     0.00   31.26    2.63   70.69   1.40   0.00
dm-15             0.00     0.00    0.04    3.28     1.03    24.36    15.32     0.00    0.93   10.94    0.82   0.03   0.01
dm-16             0.00     0.00    0.00    0.73     0.01     3.81    10.46     0.00    0.18    7.90    0.16   0.06   0.00
dm-17             0.00     0.00    0.03    0.01     0.65     0.06    37.40     0.00    1.89    1.79    2.43   0.41   0.00
dm-18             0.00     0.00    0.00    0.05     0.13     0.20    12.25     0.00    0.59    9.09    0.07   0.62   0.00
dm-19             0.00     0.00    0.00    0.04     0.00     0.04     2.22     0.00    0.47   10.92    0.11   0.09   0.00

And Errors in var log message :

Oct 15 15:06:13 opbdf1620-prd kernel: blk_update_request: I/O error, dev sdc, sector 0
Oct 15 15:06:13 opbdf1620-prd kernel: Buffer I/O error on dev sdc, logical block 0, async page read
Oct 15 15:10:01 opbdf1620-prd systemd: Started Session 67155 of user root.
Oct 15 15:10:01 opbdf1620-prd systemd: Starting Session 67155 of user root.
Oct 15 15:11:14 opbdf1620-prd kernel: sd 10:2:2:0: [sdc] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct 15 15:11:14 opbdf1620-prd kernel: sd 10:2:2:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00
Oct 15 15:11:14 opbdf1620-prd kernel: blk_update_request: I/O error, dev sdc, sector 0
Oct 15 15:11:14 opbdf1620-prd kernel: Buffer I/O error on dev sdc, logical block 0, async page read
Oct 15 15:16:16 opbdf1620-prd kernel: sd 10:2:2:0: [sdc] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct 15 15:16:16 opbdf1620-prd kernel: sd 10:2:2:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00
Oct 15 15:16:16 opbdf1620-prd kernel: blk_update_request: I/O error, dev sdc, sector 0
Oct 15 15:16:16 opbdf1620-prd kernel: Buffer I/O error on dev sdc, logical block 0, async page read

Also, i find a big number of errors on masters nodes :

Host 1620 (with failures disks)
grep -c "org.elasticsearch.xpack.monitoring.exporter.ExportException" /opt/application/es[1-4]/logs/eslog.log

master instance : 257 501 ==> wow big number of elasticsearch.xpack.monitoring.exporter.ExportException errors.
data instance : 143
data instance : 4
coord instance : 0


Host 1316 (no problem on disk , no high cpu at this moment) : 
grep -c "org.elasticsearch.xpack.monitoring.exporter.ExportException" /opt/application/es[1-4]/logs/eslog.log

master instance : 299101 ==> wow
data instance : 256
data instance : 371
coord instance : 256

Beuhlet_Reseau · October 15, 2020, 1:35pm

I have the impression that the monitoring exporters fill the logs with these errors.

it may not be the cause but a consequence ? X-pack monitoring does not seem to be consuming as much normally

grumo35 · October 15, 2020, 1:50pm

It looks like one of your hard drive is failing or something is wrong on the I/O

if you can, please run a smartctl -t long /dev/sdc

Is this a physicial server or a virtual machine ? It might just be that a disk is failing or a raid controller i dont really know.

Beuhlet_Reseau · October 16, 2020, 7:50am

Yes 2 machines are out. I excluded them for the moment.
Cluster seems more responsive (so so).

I see many gc in logs :

[2020-10-16T05:32:34,543][INFO ][o.e.m.j.JvmGcMonitorService] [opbdf1219_data_04] [gc][37582] overhead, spent [432ms] collecting in the last [1.2s]
[2020-10-16T05:33:03,338][INFO ][o.e.m.j.JvmGcMonitorService] [opbdf1219_data_04] [gc][37610] overhead, spent [451ms] collecting in the last [1.4s]
[2020-10-16T05:33:46,849][INFO ][o.e.m.j.JvmGcMonitorService] [opbdf1219_data_04] [gc][37653] overhead, spent [300ms] collecting in the last [1s]
[2020-10-16T05:35:39,899][INFO ][o.e.m.j.JvmGcMonitorService] [opbdf1219_data_04] [gc][37764] overhead, spent [279ms] collecting in the last [1s]
[2020-10-16T05:36:45,903][INFO ][o.e.m.j.JvmGcMonitorService] [opbdf1219_data_04] [gc][37829] overhead, spent [291ms] collecting in the last [1.1s]
[2020-10-16T05:37:48,085][WARN ][o.e.m.j.JvmGcMonitorService] [opbdf1219_data_04] [gc][young][37890][13155] duration [1.1s], collections [1]/[1.4s], total [1.1s]/[23.3m], memory [28.5gb]->[27.8gb]/[30gb], all_pools {[young] [1.1gb]->[8mb]/[0b]}{[survivor] [168mb]->[80mb]/[0b]}{[old] [27.2gb]->[27.8gb]/[30gb]}
[2020-10-16T05:37:48,085][WARN ][o.e.m.j.JvmGcMonitorService] [opbdf1219_data_04] [gc][37890] overhead, spent [1.1s] collecting in the last [1.4s]
[2020-10-16T05:37:51,086][INFO ][o.e.m.j.JvmGcMonitorService] [opbdf1219_data_04] [gc][37893] overhead, spent [448ms] collecting in the last [1s]
[2020-10-16T05:38:33,643][WARN ][o.e.m.j.JvmGcMonitorService] [opbdf1219_data_04] [gc][young][37894][13157] duration [1.6s], collections [1]/[42.5s], total [1.6s]/[23.4m], memory [28.7gb]->[18.5gb]/[30gb], all_pools {[young] [16mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old] [28.7gb]->[18.5gb]/[30gb]}
[2020-10-16T05:38:33,653][WARN ][o.e.m.j.JvmGcMonitorService] [opbdf1219_data_04] [gc][old][37894][3] duration [40s], collections [1]/[42.5s], total [40s]/[2.1m], memory [28.7gb]->[18.5gb]/[30gb], all_pools {[young] [16mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old] [28.7gb]->[18.5gb]/[30gb]}

My JVM conf :

-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapWastePercent=15
-XX:ParallelGCThreads=20
-XX:ConcGCThreads=5

Cluster composition :
16 physical machines (288 Go memory , 32 cpu, 30 To storage).
Each physical machine are 4 ES nodes : (in my example, this machine have 4 data nodes of 30 Go HEAP each).

(masters nodes have 15 Go HEAP - 3 masters nodes in cluster on 3 machines differents.

grumo35 · October 16, 2020, 8:18am

To be clear about your cluster configuration can you retrieve the number of shards for the total active nodes ?

At this point i'm not experienced enough to figure out what could cause such errors on your cluster.

Beuhlet_Reseau · October 16, 2020, 8:27am

Yes cluster seems ok.

I have somes red index that i go reindex.

But this many gc affraid me.

Thanks for your help

system · November 13, 2020, 8:27am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Management threads consuming high CPU on an idle cluster Elasticsearch	14	2460	November 5, 2017
Cluster locks up Elasticsearch	9	1669	July 6, 2017
Sudden Unexplained CPU Usage Elasticsearch	17	457	July 6, 2017
Errors after installing X-Pack Elasticsearch	27	29198	June 4, 2019
ES 6.1.1 Server with small data goes into GC loop Elasticsearch	15	1013	June 6, 2018

ES busy : Hot thread and xpack.monitoring.exporter full errors

Related topics