Transport Response Handler issue is crashing the ES and kibana 2.4.0

We are continuously facing the Transport Handler Issue in 2.4.0 and shards are becoming unassigned. This is taking longer time to assign all the shards back.

I saw in a thread, that 2.4.1 fixes this issue. Can any once confirm, the upgrade to 2.4.1 will fix this issue. ??

Also, is that a complex task to upgrade from 2.4.0 to 2.4.1 ?

Hey,

can you please explain what you mean with the transport Handler Issue in more detail?

Thanks!

--Alex

we are continously seeing the messages like below in the log

[2017-01-31 09:48:41,940][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][young][99642][1304] duration [3s], collections [1]/[3.9s], total [3s]/[9m], memory [25.3gb]->[19.3gb]/[27gb], all_pools {[young] [7.1gb]->[70.1mb]/[7.4gb]}{[survivor] [955.6mb]->[955.6mb]/[955.6mb]}{[old] [17.2gb]->[18.3gb]/[18.6gb]}
[2017-01-31 09:48:58,563][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2785001]
[2017-01-31 09:49:43,539][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][old][99681][35] duration [22.4s], collections [1]/[23.2s], total [22.4s]/[55m], memory [26.3gb]->[2.9gb]/[27gb], all_pools {[young] [7.1gb]->[20.5mb]/[7.4gb]}{[survivor] [955.6mb]->[0b]/[955.6mb]}{[old] [18.3gb]->[2.9gb]/[18.6gb]}
[2017-01-31 09:50:56,106][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][young][99743][1307] duration [9.7s], collections [1]/[10.6s], total [9.7s]/[9.2m], memory [11.2gb]->[6.5gb]/[27gb], all_pools {[young] [7.2gb]->[64.4mb]/[7.4gb]}{[survivor] [955.6mb]->[955.6mb]/[955.6mb]}{[old] [3gb]->[5.5gb]/[18.6gb]}
[2017-01-31 09:51:24,726][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][young][99765][1308] duration [7.1s], collections [1]/[7.4s], total [7.1s]/[9.3m], memory [13.7gb]->[8.4gb]/[27gb], all_pools {[young] [7.2gb]->[26.1mb]/[7.4gb]}{[survivor] [955.6mb]->[955.6mb]/[955.6mb]}{[old] [5.5gb]->[7.4gb]/[18.6gb]}
[2017-01-31 09:51:49,367][INFO ][cluster.service ] [ITTESPROD-DATA3] added {{ITTESPROD-DATA4}{a3vFGxwST8i4eKZig5kuzg}{10.158.36.208}{10.158.36.208:9300}{master=false},}, reason: zen-disco-receive(from master [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}])
[2017-01-31 09:51:54,354][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][young][99793][1309] duration [2.2s], collections [1]/[2.4s], total [2.2s]/[9.3m], memory [15.8gb]->[9.9gb]/[27gb], all_pools {[young] [7.4gb]->[43.8mb]/[7.4gb]}{[survivor] [955.6mb]->[955.6mb]/[955.6mb]}{[old] [7.4gb]->[8.9gb]/[18.6gb]}
[2017-01-31 09:51:57,785][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795027]
[2017-01-31 09:51:57,785][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795026]
[2017-01-31 09:51:57,818][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795029]
[2017-01-31 09:51:57,819][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795028]
[2017-01-31 09:51:57,934][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795030]
[2017-01-31 09:51:57,935][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795031]
[2017-01-31 09:51:57,935][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795033]
[2017-01-31 09:51:57,935][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795035]
[2017-01-31 09:51:57,935][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795032]
[2017-01-31 09:51:57,951][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795036]
[2017-01-31 09:51:57,951][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795034]
[2017-01-31 09:51:57,983][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795038]
[2017-01-31 09:51:58,006][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795037]
[2017-01-31 09:51:58,068][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795039]
[2017-01-31 09:51:58,288][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795045]
[2017-01-31 09:51:58,303][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795047]
[2017-01-31 09:51:58,311][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795042]
[2017-01-31 09:51:58,311][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795044]
[2017-01-31 09:51:58,319][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795043]
[2017-01-31 09:51:58,321][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795046]
[2017-01-31 09:51:58,321][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795040]
[2017-01-31 09:51:58,332][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795041]
[2017-01-31 09:51:58,338][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795049]
[2017-01-31 09:51:58,342][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795054]
[2017-01-31 09:51:58,343][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795051]
[2017-01-31 09:51:58,346][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795052]
[2017-01-31 09:51:58,352][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795048]
[2017-01-31 09:51:58,358][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795053]
[2017-01-31 09:51:58,359][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2795059]

Hey,

the messages above the error show, that elasticsearch is doing garbage collection (which means that the application stops) - I suppose the following messages is just aftermath after your system hung for as much as 20 seconds (as you can see in the gc log entries).

So the underlying issue here, is to find out why your system is doing such long garbage collections. Did you maybe assign way too much memory to your JVM, which resulted in a lot of memory piling up, before it gets collected (and this huge collection took a lot of time)?

--Alex

Hi,

We have assigned the JVM heap ES_HEAP_SIZE of 28 GB which is 50% of 56 GB Memory available as per the best practice.

Also, below is the ES Config looks like, please suggest, if we are deviating any best practices

http.enabled: false
threadpool.bulk.size: 8
threadpool.bulk.queue_size: 1000
bootstrap.memory_lock: true
indices.memory.index_buffer_size : 50%
indices.requests.cache.size: 5%
indices.queries.cache.size: 15%
indices.store.throttle.max_bytes_per_sec : 2gb

You know you can overwhelm your memory with that ridiculous high setting? Keep it at the default value.

If we are setting it to default, queue is getting full, and indexing requests are being discarded, because of which, we changed it to 1000.

This is clearly because threadpool.bulk.size: 8 which is too small, it is not the default.

You make things worse when you increase the queue_size, then your client will not be able to push data at speed, instead you will fill up the memory and stall the client until it stops because being overwhelmed.

The configuration is a mess or your cluster nodes are too weak. But you do not show the complete configuration.

If the bulk queue is getting full, and you use the default thread pool setting, you hit the node's capacity to receive data. Maybe also a network or cluster setup issue. Check bandwidth and performance data of receiving nodes.

1 Like

Also the indices.store.throttle.max_bytes_per_sec : 2gb settings looks weird. Can you drop all that special configuration and retry? (with the exception of the mlockall setting, that should be set if you are using swap).

Seems the default is cores*5, which I reduced 40 to 8. Is this leading the increase in queue size ? Currently I see the Queue size to 200 also. We use to get, request rejections too, with default configuration of size and queue. So, because of which, I have increased it to 1000.

Below is the complete config.

cluster.name: ittesprod
node.name: ITTESPROD-DATA3
node.master: false
node.data: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
network.host: _non_loopback_
http.enabled: false
discovery.zen.ping.unicast.hosts: ["10.158.36.220","10.158.36.200","10.158.36.201","10.158.36.202","10.158.36.210","10.158.36.211","10.158.36.212","10.158.36.204"]
threadpool.bulk.size: 8
threadpool.bulk.queue_size: 1000
bootstrap.memory_lock: true
indices.memory.index_buffer_size : 25%
indices.requests.cache.size: 5%
indices.queries.cache.size: 15%
indices.store.throttle.max_bytes_per_sec : 2gb

If the bulk queue is getting full, and you use the default thread pool setting, you hit the node's capacity to receive data. Maybe also a network or cluster setup issue. Check bandwidth and performance data of receiving nodes.

Our nodes are in Azure and have little less in network bandwidth and I/O Per Sec.

@spinscale we are worried of getting the bulk requests getting rejected with the default config, and we have changed this setting.

indices.store.throttle.max_bytes_per_sec : 2gb

We changed it to improve the indexing performance based on suggestion here . But, I guess 2 GB is little high. Will this eat up lot of JVM Heap?

I have just changed the config to below, I still faced the issue, GC took around 2 mins. and node got disconnected.

cluster.name: ittesprod
node.name: ITTESPROD-DATA3
node.master: false
node.data: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
network.host: _non_loopback_
http.enabled: false
discovery.zen.ping.unicast.hosts: ["10.158.36.220","10.158.36.200","10.158.36.201","10.158.36.202","10.158.36.210","10.158.36.211","10.158.36.212","10.158.36.204"]
threadpool.bulk.queue_size: 100
bootstrap.memory_lock: true
indices.memory.index_buffer_size : 15%
indices.requests.cache.size: 5%
indices.queries.cache.size: 15%

Also, what I am wondering is, I am having same configuration with every node(Cluster is 4 data, 3 master and 3 client), but, only 1 data node is having this issue of disconnection. Any other setting to monitor in that node?

the max_bytes_per_sec has nothing to do with bulk indexing. How did you make that connection?

Did you ensure that you dont send too large bulk requests. What's your bulk request size?

Does this one do anything different? Like accepting all the search/bulk connections instead of those beign round robin'ed across the cluster?

@spinscale We are using logstash for indexing, its default batch size of 125 .

I have the same config, across all the nodes. Is there any thing I can check, if this nodes is configured to accept all the search/bulk connections ??

Also, discovery.zen.ping.unicast.hosts, setting should contain IPs of all the nodes ? and should have that in config of all the nodes?

Currently, I don't have all the nodes IPs in the discovery.zen.ping.unicast.hosts in all configs. ?? does this to do anything to route the requests?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.