Hi Team,
Cluster Specification:
ES version - 6.2.2
Nodes : 3 (true => master, data, ingest)
Heap : 30 GB per node
RAM : 128 GB , 64 GB, 64 GB
Core: 24
Disk Available : approx 200 GB
indexes - 169
Replica - 0
Per index size - 50GB approx (50 cr records)
Hot indexes - 40
shards - 6
Node settings:
indices.memory.index_buffer_size: 50%
thread_pool.index.size : 24
thread_pool.index.queue_size : 10000
thread_pool.bulk.size: 24
thread_pool.bulk.queue_size: 30000
thread_pool.search.size: 50
thread_pool.search.queue_size: 30000
OS settings
/etc/security/limits.conf
elasticsearch soft nproc 4096
elasticsearch hard nproc 4096
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
Requests:
Indexing : 5k - 6k / sec (Bulk => Insert / update / partial / Few request with script condition, with the batch of few KBs only. with 200 Parallel process )
Searching : 5k - 6k read request (search / count / aggregation Few with script condition)
Problem : One of the node from ES cluster not gives response till long time. I simply try to hit curl mynode1.com:9200
. It gives me timeout. After some time it start giving responses.
Observations :
-
Whenever any heavy search queries comes, It start blocking one of the node's port 9200.
-
As per slow query log search query is
_search
withsize:1000
andfrom:500000
& few matches parameters. -
Whenever this situation occurs my write / Bulk query becomes slow or getting
{"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE\/2\/no master];"}
Although my 2 node keep responding. -
Read query takes approx 40 + seconds with simple search and write also takes 40 + seconds.
-
Once search queries done then situation becomes normal.
-
Other two server getting timeout exception. Logs
[2019-10-07T07:00:32,408][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [mynode1.com] failed to execute on node [uHJv1ylwTZaqDkKUUPjr0Q]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [mynode2.com][202.162.235.111:9300][cluster:monitor/nodes/stats[n]] request_id [328205321] timed out after [15037ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:982) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161] -
Also check with Hardware stuff. Nothing looks strange.
I can't find the any reason which causing socket block. Any insights or suggestions will be appreciable guys