Transport Response Handler issue is crashing the ES and kibana 2.4.0

nethis · January 30, 2017, 5:22pm

We are continuously facing the Transport Handler Issue in 2.4.0 and shards are becoming unassigned. This is taking longer time to assign all the shards back.

I saw in a thread, that 2.4.1 fixes this issue. Can any once confirm, the upgrade to 2.4.1 will fix this issue. ??

Also, is that a complex task to upgrade from 2.4.0 to 2.4.1 ?

spinscale · January 31, 2017, 9:26am

Hey,

can you please explain what you mean with the transport Handler Issue in more detail?

Thanks!

--Alex

nethis · January 31, 2017, 10:45am

we are continously seeing the messages like below in the log

[2017-01-31 09:48:41,940][WARN ][monitor.jvm [2017-01-31 09:48:58,563][WARN ][transport [2017-01-31 09:49:43,539][WARN ][monitor.jvm [2017-01-31 09:50:56,106][WARN ][monitor.jvm [2017-01-31 09:51:24,726][WARN ][monitor.jvm [2017-01-31 09:51:49,367][INFO ][cluster.service [2017-01-31 09:51:54,354][WARN ][monitor.jvm [2017-01-31 09:51:57,785][WARN ][transport [2017-01-31 09:51:57,785][WARN ][transport [2017-01-31 09:51:57,818][WARN ][transport [2017-01-31 09:51:57,819][WARN ][transport [2017-01-31 09:51:57,934][WARN ][transport [2017-01-31 09:51:57,935][WARN ][transport [2017-01-31 09:51:57,935][WARN ][transport [2017-01-31 09:51:57,935][WARN ][transport [2017-01-31 09:51:57,935][WARN ][transport [2017-01-31 09:51:57,951][WARN ][transport [2017-01-31 09:51:57,951][WARN ][transport [2017-01-31 09:51:57,983][WARN ][transport [2017-01-31 09:51:58,006][WARN ][transport [2017-01-31 09:51:58,068][WARN ][transport [2017-01-31 09:51:58,288][WARN ][transport [2017-01-31 09:51:58,303][WARN ][transport [2017-01-31 09:51:58,311][WARN ][transport [2017-01-31 09:51:58,311][WARN ][transport [2017-01-31 09:51:58,319][WARN ][transport [2017-01-31 09:51:58,321][WARN ][transport [2017-01-31 09:51:58,321][WARN ][transport [2017-01-31 09:51:58,332][WARN ][transport [2017-01-31 09:51:58,338][WARN ][transport [2017-01-31 09:51:58,342][WARN ][transport [2017-01-31 09:51:58,343][WARN ][transport [2017-01-31 09:51:58,346][WARN ][transport [2017-01-31 09:51:58,352][WARN ][transport [2017-01-31 09:51:58,358][WARN ][transport [2017-01-31 09:51:58,359][WARN ][transport ] [ITTESPROD-DATA3] [gc][young][99642][1304] duration [3s], collections [1]/[3.9s], total [3s]/[9m], memory [25.3gb]->[19.3gb]/[27gb], all_pools {[young] [7.1gb]->[70.1mb]/[7.4gb]}{[survivor] [955.6mb]->[955.6mb]/[955.6mb]}{[old] [17.2gb]->[18.3gb]/[18.6gb]}
] [ITTESPROD-DATA3] Transport response handler not found of id [2785001]
] [ITTESPROD-DATA3] [gc][old][99681][35] duration [22.4s], collections [1]/[23.2s], total [22.4s]/[55m], memory [26.3gb]->[2.9gb]/[27gb], all_pools {[young] [7.1gb]->[20.5mb]/[7.4gb]}{[survivor] [955.6mb]->[0b]/[955.6mb]}{[old] [18.3gb]->[2.9gb]/[18.6gb]}
] [ITTESPROD-DATA3] [gc][young][99743][1307] duration [9.7s], collections [1]/[10.6s], total [9.7s]/[9.2m], memory [11.2gb]->[6.5gb]/[27gb], all_pools {[young] [7.2gb]->[64.4mb]/[7.4gb]}{[survivor] [955.6mb]->[955.6mb]/[955.6mb]}{[old] [3gb]->[5.5gb]/[18.6gb]}
] [ITTESPROD-DATA3] [gc][young][99765][1308] duration [7.1s], collections [1]/[7.4s], total [7.1s]/[9.3m], memory [13.7gb]->[8.4gb]/[27gb], all_pools {[young] [7.2gb]->[26.1mb]/[7.4gb]}{[survivor] [955.6mb]->[955.6mb]/[955.6mb]}{[old] [5.5gb]->[7.4gb]/[18.6gb]}
] [ITTESPROD-DATA3] added {{ITTESPROD-DATA4}{a3vFGxwST8i4eKZig5kuzg}{10.158.36.208}{10.158.36.208:9300}{master=false},}, reason: zen-disco-receive(from master [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}])
] [ITTESPROD-DATA3] [gc][young][99793][1309] duration [2.2s], collections [1]/[2.4s], total [2.2s]/[9.3m], memory [15.8gb]->[9.9gb]/[27gb], all_pools {[young] [7.4gb]->[43.8mb]/[7.4gb]}{[survivor] [955.6mb]->[955.6mb]/[955.6mb]}{[old] [7.4gb]->[8.9gb]/[18.6gb]}
] [ITTESPROD-DATA3] Transport response handler not found of id [2795027]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795026]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795029]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795028]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795030]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795031]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795033]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795035]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795032]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795036]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795034]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795038]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795037]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795039]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795045]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795047]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795042]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795044]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795043]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795046]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795040]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795041]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795049]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795054]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795051]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795052]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795048]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795053]
] [ITTESPROD-DATA3] Transport response handler not found of id [2795059]

spinscale · January 31, 2017, 11:36am

Hey,

the messages above the error show, that elasticsearch is doing garbage collection (which means that the application stops) - I suppose the following messages is just aftermath after your system hung for as much as 20 seconds (as you can see in the gc log entries).

So the underlying issue here, is to find out why your system is doing such long garbage collections. Did you maybe assign way too much memory to your JVM, which resulted in a lot of memory piling up, before it gets collected (and this huge collection took a lot of time)?

--Alex

nethis · January 31, 2017, 12:29pm

Hi,

We have assigned the JVM heap ES_HEAP_SIZE of 28 GB which is 50% of 56 GB Memory available as per the best practice.

Also, below is the ES Config looks like, please suggest, if we are deviating any best practices

http.enabled: false
threadpool.bulk.size: 8
threadpool.bulk.queue_size: 1000
bootstrap.memory_lock: true
indices.memory.index_buffer_size : 50%
indices.requests.cache.size: 5%
indices.queries.cache.size: 15%
indices.store.throttle.max_bytes_per_sec : 2gb

jprante · January 31, 2017, 5:53pm

You know you can overwhelm your memory with that ridiculous high setting? Keep it at the default value.

nethis · January 31, 2017, 6:22pm

If we are setting it to default, queue is getting full, and indexing requests are being discarded, because of which, we changed it to 1000.

jprante · January 31, 2017, 6:41pm

This is clearly because threadpool.bulk.size: 8 which is too small, it is not the default.

You make things worse when you increase the queue_size, then your client will not be able to push data at speed, instead you will fill up the memory and stall the client until it stops because being overwhelmed.

The configuration is a mess or your cluster nodes are too weak. But you do not show the complete configuration.

If the bulk queue is getting full, and you use the default thread pool setting, you hit the node's capacity to receive data. Maybe also a network or cluster setup issue. Check bandwidth and performance data of receiving nodes.

spinscale · February 1, 2017, 8:26am

Also the indices.store.throttle.max_bytes_per_sec : 2gb settings looks weird. Can you drop all that special configuration and retry? (with the exception of the mlockall setting, that should be set if you are using swap).

nethis · February 1, 2017, 10:08am

Seems the default is cores*5, which I reduced 40 to 8. Is this leading the increase in queue size ? Currently I see the Queue size to 200 also. We use to get, request rejections too, with default configuration of size and queue. So, because of which, I have increased it to 1000.

Below is the complete config.

cluster.name: ittesprod
node.name: ITTESPROD-DATA3
node.master: false
node.data: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
network.host: _non_loopback_
http.enabled: false
discovery.zen.ping.unicast.hosts: ["10.158.36.220","10.158.36.200","10.158.36.201","10.158.36.202","10.158.36.210","10.158.36.211","10.158.36.212","10.158.36.204"]
threadpool.bulk.size: 8
threadpool.bulk.queue_size: 1000
bootstrap.memory_lock: true
indices.memory.index_buffer_size : 25%
indices.requests.cache.size: 5%
indices.queries.cache.size: 15%
indices.store.throttle.max_bytes_per_sec : 2gb

If the bulk queue is getting full, and you use the default thread pool setting, you hit the node's capacity to receive data. Maybe also a network or cluster setup issue. Check bandwidth and performance data of receiving nodes.

Our nodes are in Azure and have little less in network bandwidth and I/O Per Sec.

nethis · February 1, 2017, 10:11am

@spinscale we are worried of getting the bulk requests getting rejected with the default config, and we have changed this setting.

indices.store.throttle.max_bytes_per_sec : 2gb

We changed it to improve the indexing performance based on suggestion here . But, I guess 2 GB is little high. Will this eat up lot of JVM Heap?

nethis · February 1, 2017, 11:57am

I have just changed the config to below, I still faced the issue, GC took around 2 mins. and node got disconnected.

cluster.name: ittesprod
node.name: ITTESPROD-DATA3
node.master: false
node.data: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
network.host: _non_loopback_
http.enabled: false
discovery.zen.ping.unicast.hosts: ["10.158.36.220","10.158.36.200","10.158.36.201","10.158.36.202","10.158.36.210","10.158.36.211","10.158.36.212","10.158.36.204"]
threadpool.bulk.queue_size: 100
bootstrap.memory_lock: true
indices.memory.index_buffer_size : 15%
indices.requests.cache.size: 5%
indices.queries.cache.size: 15%

Also, what I am wondering is, I am having same configuration with every node(Cluster is 4 data, 3 master and 3 client), but, only 1 data node is having this issue of disconnection. Any other setting to monitor in that node?

spinscale · February 1, 2017, 1:12pm

the max_bytes_per_sec has nothing to do with bulk indexing. How did you make that connection?

Did you ensure that you dont send too large bulk requests. What's your bulk request size?

Does this one do anything different? Like accepting all the search/bulk connections instead of those beign round robin'ed across the cluster?

nethis · February 2, 2017, 4:16am

@spinscale We are using logstash for indexing, its default batch size of 125 .

I have the same config, across all the nodes. Is there any thing I can check, if this nodes is configured to accept all the search/bulk connections ??

Also, discovery.zen.ping.unicast.hosts, setting should contain IPs of all the nodes ? and should have that in config of all the nodes?

Currently, I don't have all the nodes IPs in the discovery.zen.ping.unicast.hosts in all configs. ?? does this to do anything to route the requests?

Topic		Replies	Views
Upgraded from ES2.3.5 to 2.4.0, seeing: Transport response handler not found of id Elasticsearch	21	10468	January 30, 2017
Transport response handler not found of Elasticsearch	6	8508	October 12, 2016
Nodes are getting disconnecting and connecting again in few seconds Elasticsearch	5	3432	February 1, 2017
Warn which crashes server Elasticsearch	15	2224	May 13, 2010
Did I get to the upper limit of /bulk upload API? Elasticsearch	47	14416	December 1, 2016

Transport Response Handler issue is crashing the ES and kibana 2.4.0

Related topics