Hi everyone,
We have a pretty limiting issue with our elasticsearch cluster. We have 3 master nodes, 3 data nodes and 2 client nodes in our cluster.
Our code creates time based indexes every half an hour. This is meant to simplify time based expiry, and address some other design requirements/constraints in the system. We are looking at making the indices bigger (maybe a day), but need to resolve our issues in the present stack that seem unrelated to the index design.
It all started with having around 2500 indices with the default config for num_of_shards (5). This made our queries incredibly slow. Based on the recommendation from ES support, we re-indexed with the settings changed in our index template to make num_of_shards as 1, as there isnt a lot of data per index at the moment.
The re-index seemed to do the trick, as our query times came down from 15s to 5s with the change. We are however seeing several transport client delays that seem to be 8-10 times the actual ES query time (we printed the "took" value from the JSON response in the transport client to compare what time ES really took and when we got a response). ES takes 5s per the "took", but the transport client returns only after 25-45s.
We are seeing tons of un connected exceptions.
In the master we see:
2017-07-20 00:56:24,054][WARN ][discovery.zen.publish ] [staging-1-elasticsearch-main-master-1] timed out waiting for all nodes to process published state [4] (timeout [30s], pendi
ng nodes: [{staging-1-elasticsearch-main-data-1}{p5RtXb8HTR6UUK96t5nAuA}{10.203.110.1}{10.203.110.1:9300}])
[2017-07-20 00:56:24,116][WARN ][cluster.service ] [staging-1-elasticsearch-main-master-1] cluster state update task [zen-disco-join(join from node[{staging-1-elasticsearch-ma
in-data-1}{p5RtXb8HTR6UUK96t5nAuA}{10.203.110.1}{10.203.110.1:9300}])] took 30.2s above the warn threshold of 30s
In our transport client we see:
{"timestamp":"2017-07-20T04:03:49.004+00:00","message":"[Achilles] SSL/TLS handshake failed, closing channel: null","loggerName":"org.elasticsearch.shield.transport.netty","thread":"elasticsearch[Achilles][transport_client_worker][T#15]{New I/O worker #327}","level":"ERROR","className":"org.elasticsearch.common.logging.log4j.Log4jESLogger","methodName":"internalError","fileName":"Log4jESLogger.java","line":140}
We were seeing this:
[T#1]","level":"INFO","stack_trace":"org.elasticsearch.transport.ReceiveTimeoutTransportException: [][10.143.243.134:9300][cluster:monitor/nodes/liveness] request_id [4] timed out after [5000ms]\n\tat
but we changed the client timeout to a larger value, so this seems to have gone away.
On the ES Client node we see the following:
[2017-07-19 01:07:50,753][WARN ][netty.handler.ssl.SslHandler] Unexpected leftover data after SSLEngine.unwrap(): status=OK handshakeStatus=NEED_WRAP consumed=0 produced=0 remaining=7 data=15030300020100
Unsure what the cause of these errors are, and if re-indexing/process to re-index left the cluster in an unstable state.
Anyone run into this kind of an issue? Any pointers would be very helpful.
Thanks,
Manasvini