Hello,
Happy new year.
It seems Spark Streaming job was slow into Elasticsearch.
I check master and find this erros in log ?
[...] only morning :
[2021-01-04T01:59:56,349][WARN ][o.e.x.m.e.l.LocalExporter] [opu309_master_9] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: org.elasticsearch.common.ValidationException: Validation Failed: 1: this action would add [1] total shards, but this cluster currently has [36000]/[36000] maximum shards open
[...] many circuit breaker
[2021-01-04T14:29:18,108][WARN ][o.e.c.r.a.AllocationService] [opu309_master_9] failing shard [failed shard, shard [.monitoring-es-7-2021.01.04][0], node[LhnkRvT_SHO2cTd-0PCaxw], [R], s[STARTED], a[id=2Pm8xcIqQgawHEgKdWRD2g], message [failed to perform indices:data/write/bulk[s] on replica [.monitoring-es-7-2021.01.04][0], node[LhnkRvT_SHO2cTd-0PCaxw], [R], s[STARTED], a[id=2Pm8xcIqQgawHEgKdWRD2g]], failure [RemoteTransportException[[opvuc3505_master_90][10.100.229.138:9390][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [15323042428/14.2gb], which is larger than the limit of [15300820992/14.2gb], real usage: [15323036912/14.2gb], new bytes reserved: [5516/5.3kb], usages [request=0/0b, fielddata=11911397/11.3mb, in_flight_requests=11468/11.1kb, accounting=487538592/464.9mb]]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [opvuc3505_master_90][10.100.229.138:9390][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [15323042428/14.2gb], which is larger than the limit of [15300820992/14.2gb], real usage: [15323036912/14.2gb], new bytes reserved: [5516/5.3kb], usages [request=0/0b, fielddata=11911397/11.3mb, in_flight_requests=11468/11.1kb, accounting=487538592/464.9mb]
[...] somes full gc
[2021-01-04T08:57:15,848][INFO ][o.e.m.j.JvmGcMonitorService] [opu309_master_9] [gc][young][1193986][278302] duration [770ms], collections [1]/[1.6s], total [770ms]/[9.4h], memory [9.1gb]->[9.3gb]/[15gb], all_pools {[young] [216mb]->[0b]/[0b]}{[old] [8.9gb]->[9.1gb]/[15gb]}{[survivor] [28mb]->[160mb]/[0b]}
[...] many gc
[2021-01-04T16:10:18,408][INFO ][o.e.m.j.JvmGcMonitorService] [opu309_master_9] [gc][1219901] overhead, spent [257ms] collecting in the last [1s]
[2021-01-04T16:18:27,412][INFO ][o.e.m.j.JvmGcMonitorService] [opu309_master_9] [gc][1220389] overhead, spent [287ms] collecting in the last [1s]
[2021-01-04T16:19:26,620][INFO ][o.e.m.j.JvmGcMonitorService] [opu309_master_9] [gc][1220448] overhead, spent [267ms] collecting in the last [1s]
[2021-01-04T16:21:26,937][INFO ][o.e.m.j.JvmGcMonitorService] [opvu370_master_0] [gc][1220568] overhead, spent [292ms] collecting in the last [1s]
[2021-01-04T16:23:07,209][INFO ][o.e.m.j.JvmGcMonitorService] [opu309_master_9] [gc][1220668] overhead, spent [267ms] collecting in the last [1s]
[2021-01-04T16:26:18,337][INFO ][o.e.m.j.JvmGcMonitorService] [opu309_master_9] [gc][1220859] overhead, spent [260ms] collecting in the last [1s]
[2021-01-04T16:26:27,350][INFO ][o.e.m.j.JvmGcMonitorService] [opu309_master_9] [gc][1220868] overhead, spent [302ms] collecting in the last [1s]
[2021-01-04T16:26:57,398][INFO ][o.e.m.j.JvmGcMonitorService] [opu309_master_9] [gc][1220898] overhead, spent [289ms] collecting in the last [1s]
[2021-01-04T16:29:57,163][INFO ][o.e.m.j.JvmGcMonitorService] [opu309_master_9] [gc][1221077] overhead, spent [307ms] collecting in the last [1s]
[...]
Conf JVM des masters (3 instances) :
-Xms15g
-Xmx15g
-server
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapWastePercent=15
-XX:ParallelGCThreads=5
-XX:ConcGCThreads=3
-XX:+AlwaysPreTouch
-XX:MaxDirectMemorySize=7g
Conf JVM des data (36) et coord (2 instances)
-Xms32g
-Xmx32g
-server
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapWastePercent=15
-XX:ParallelGCThreads=5
-XX:ConcGCThreads=3
-XX:+AlwaysPreTouch
-XX:MaxDirectMemorySize=16g
Conf elasticsearch.yml ("specials")
network.host: ["_eth1:ipv4_",_local_]^M
xpack.monitoring.enabled: true
xpack.monitoring.collection.enabled: true
indices.memory.index_buffer_size: 15%
processors: 10
thread_pool:
search:
size: 13
queue_size: 1000
write:
size: 9
queue_size: 1000
Sharding :
Shards :
"active_primary_shards" : 17913
"active_shards" : 36000
Do you think i have a big problem on my side ?
Archi : 10 physical machines (32 cpu 280 Go RAM) with multi instances (four instances JVM by hosts)
Cordialy
Beuh