Elasticsearch socket block OR timeout

Hi Team,

Cluster Specification:

ES version - 6.2.2
Nodes : 3 (true => master, data, ingest)
Heap : 30 GB per node
RAM : 128 GB , 64  GB, 64 GB
Core: 24
Disk Available : approx 200 GB
indexes - 169 
Replica - 0
Per index size - 50GB approx (50 cr records)
Hot indexes - 40
shards - 6

Node settings:

indices.memory.index_buffer_size: 50%

thread_pool.index.size : 24
thread_pool.index.queue_size : 10000
thread_pool.bulk.size: 24
thread_pool.bulk.queue_size: 30000
thread_pool.search.size: 50
thread_pool.search.queue_size: 30000

OS settings

/etc/security/limits.conf
elasticsearch soft  nproc 4096
elasticsearch hard  nproc 4096
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited

Requests:

Indexing : 5k - 6k / sec (Bulk => Insert / update / partial / Few request with script condition, with the batch of few KBs only. with 200 Parallel process )
Searching : 5k - 6k read request (search / count / aggregation Few with script condition)

Problem : One of the node from ES cluster not gives response till long time. I simply try to hit curl mynode1.com:9200 . It gives me timeout. After some time it start giving responses.

Observations :

  1. Whenever any heavy search queries comes, It start blocking one of the node's port 9200.

  2. As per slow query log search query is _search with size:1000 and from:500000 & few matches parameters.

  3. Whenever this situation occurs my write / Bulk query becomes slow or getting {"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE\/2\/no master];"} Although my 2 node keep responding.

  4. Read query takes approx 40 + seconds with simple search and write also takes 40 + seconds.

  5. Once search queries done then situation becomes normal.

  6. Other two server getting timeout exception. Logs

    [2019-10-07T07:00:32,408][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [mynode1.com] failed to execute on node [uHJv1ylwTZaqDkKUUPjr0Q]
    org.elasticsearch.transport.ReceiveTimeoutTransportException: [mynode2.com][202.162.235.111:9300][cluster:monitor/nodes/stats[n]] request_id [328205321] timed out after [15037ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:982) [elasticsearch-6.2.2.jar:6.2.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.2.jar:6.2.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]

  7. Also check with Hardware stuff. Nothing looks strange.

I can't find the any reason which causing socket block. Any insights or suggestions will be appreciable guys :slight_smile:

Hey,

it seems you also have fiddled around with other parameters, as from + size can by default cannot exceed 10000. This also has a reason. deep pagination (what you are trying to do) will require a decent amount of memory, which in turn might end up in garbage collection (and that might be the reason for your slowness).

The correct way to do deep pagination is to use search_after or a scroll search instead. If you have lots of requests doing this, then search after should be used, as scroll is a point in time snapshot that requires certain resources to be held open.

--Alex

Thanks Alex for your response. From + size should not be exceed 10k so elasticsearch will load all data (From + size) in heap ? Means there will be 10k data will be load in heap ? And one more doubt how it will impact to my write operations ?

No, it will not load all data in heap.

See https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html

Ohhh got clearity. I will convert to scroll. But still curious how it will impact my write ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.