Upgraded from ES2.3.5 to 2.4.0, seeing: Transport response handler not found of id


(J) #1

I just upgraded to ES2.4.0 and I'm noticing these errors every time ES starts. What does this mean?

2016-09-04 04:03:38,655][WARN ][transport                ] [caste] Transport response handler not found of id [238]
[2016-09-04 04:03:38,764][WARN ][transport                ] [caste] Transport response handler not found of id [241]
[2016-09-04 04:03:39,567][WARN ][transport                ] [caste] Transport response handler not found of id [246]
[2016-09-04 04:03:39,815][WARN ][transport                ] [caste] Transport response handler not found of id [248]
[2016-09-04 04:03:42,886][WARN ][transport                ] [caste] Transport response handler not found of id [313]
[2016-09-04 04:03:44,960][WARN ][transport                ] [caste] Transport response handler not found of id [346]
[2016-09-04 04:03:44,994][WARN ][transport                ] [caste] Transport response handler not found of id [347]
[2016-09-04 04:03:45,881][WARN ][transport                ] [caste] Transport response handler not found of id [359]
[2016-09-04 04:03:46,007][WARN ][transport                ] [caste] Transport response handler not found of id [360]
[2016-09-04 04:03:49,791][WARN ][transport                ] [caste] Transport response handler not found of id [388]

Shield problems "Transport response handler not found"
(Tin Le) #2

Make sure all nodes are same version, 2.4. Do you have beats feeding into your cluster? Have you upgraded logstash to 2.4?

Tin


(J) #3

all nodes are upgraded from 2.3.5 to 2.4, no beats. LS 1.5 using http protocol. I never seen this message before.


(Tin Le) #4

The only time I've seen that error message is when there is a mismatch in version. Maybe upgrade your LS?

Tin


(J) #5

Thanks, unfortunately I cant upgrade LS, the plugins that I use don't work well with latest LS. I restarted the ES cluster with LS off, and I was still seeing these messages.


(Tin Le) #6

Yes, look like it might be something else then... Maybe Elastic people can chime in.

Tin


(Ids Van Der Molen) #7

Hi, I also noticed the same errors when using ES 2.4.0 and logstash 2.4.0.


(Koen Vanoppen) #8

Yep, me too... It's just spitting log lines one after the other filling the whole disk...
And sometimes this comes up: (elasticsearch03dev-es-cluster-dev is another node)
[elasticsearch03dev-es-cluster-dev][10.206.13.216:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elasticsearch03dev-es-cluster-dev][10.206.13.216:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@6807b591 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5e53305a[Running, pool size = 4, active threads = 4, queued tasks = 4138, completed tasks = 29660586]]];]];
[0]: index [.marvel-es-1-2016.09.14], type [node_stats], id [AVcnnsDhtV0Qddn6Zthd], message [RemoteTransportException[[elasticsearch03dev-es-cluster-dev][10.206.13.216:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elasticsearch03dev-es-cluster-dev][10.206.13.216:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@6807b591 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5e53305a[Running, pool size = 4, active threads = 4, queued tasks = 4138, completed tasks = 29660586]]];]]
[0]: index [.marvel-es-1-2016.09.14], type [node_stats], id [AVcnnum-tV0Qddn6ZudA], message [RemoteTransportException[[elasticsearch03dev-es-cluster-dev][10.206.13.216:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elasticsearch03dev-es-cluster-dev][10.206.13.216:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@6131d78d on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5e53305a[Running, pool size = 4, active threads = 4, queued tasks = 4005, completed tasks = 29662634]]];]];


#9

Same problem hier. We are running 4 data-nodes - all on es 2.4.0. I just upgraded all logstah-loggers to 2.4.0 but am still getting those transport errors.
It might be helpful to know that we are running 2 servers with 2 nodes each.


(Kim Kruse Hansen) #10

Same problem for me . 2 separate clusters each with 2 nodes , all at 2.4.0
Happens frequently


(Brad) #11

+1 (after upgrading to 2.4)


(Tin Le) #12

I have 2.4.0 running on a test cluster of 3 nodes and have not had time to look at it recently. I just checked this morning, and although the number of these WARNings has gone down, I do see them in 2 of the data nodes. I no longer see them in my dedicated master node.

From the look of it, these warnings happen when the nodes are memory stressed and/or experiencing high CPU load. Look like Elastic added these warnings in 2.4.

Check your ES logs and see if you also see GC, high load and/or NodeDisconnectedException around same time frame.


#13

I also see this message not only during times of heavy load, but when system is rather quiet. I am running on GCE cloud. I wonder if this relates to network interruptions perhaps. Would be nice to know what this means.


#14

I'm seeing the same problem, so hoping elastic will respond. (edit: I found a solution, which is in the bottom of my post)

All of my nodes are 2.4.0. I just upgraded them all from 2.1.1 with 2.4.0. In my case, I have a master and n data nodes that are all behaving fine; however, I have a couple client-only (non-data, non-master) nodes that are showing this error after the upgrade.

I tried increasing memory allocations on those boxes, but that had no effect. There is no running logstash or kibana. The cluster is closed off for the moment, so there are no clients. I wondered if it was a networking issue, but I can reach the master from the client nodes (curl to the ES API works). They're on the same subnet, and I disabled iptables on the clients just in case. Nothing has helped. I'll try updating my log level next.

I'll add that this is preventing the client nodes from connecting to the master, so they're pretty useless.

edit: I noticed in logs later that I was seeing the below: failed to send join request to master [{Poltergeist}{eBP_rUJCTDimhgLUTiduxg}{192.168.1.4}{192.168.1.4:9300}{data=false, master=true}], reason [RemoteTransportException[[Poltergeist][192.168.1.4:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[Star-Lord][10.x.x.x:9300] connect_timeout[30s]]; nested: NotSerializableExceptionWrapper[connect_timeout_exception: connection timed out: /10.x.x.x:9300];

In this case, I saw an IP on a separate interface was being used to connect to the master. I temporarily turned that interface off and I was able to start my client without issues. Afterward, I reenabled that interface. Is this going to be ok or will the problem recur now that I've enabled the iface? For the time being (first few minutes after restart), it's ok. I suspect there's a config somewhere I should be able to set to get around this, as I can't keep shutting down our interfaces. How can I inform ES 2.4.0 to only use one interface for talking to master (but a separate interface for serving client requests)?

edit2 (solution for me): the config I needed to set was network.publish_host. What's strange is I didn't need to put this in my prior configs. I've always used the default network settings in prior ES versions.


(Seth S) #15

+1 for this issue

followed @milleka2's advice to set network.publish_host on each ES node and I'm still receiving the error.


(Tin Le) #16

Someone reported seeing this warning when they changed their network MTU.

So this seem to be related to losing packets.


(Seth S) #17

**

EDIT: The fix mentioned below did not stop this error from occurring.

**

I think that lost packets could cause this, however there have been no changes in MTU settings at my site.

My understanding (and someone please correct me if I'm wrong) is that these errors are regarding events being lost at the transport layer for whatever reason... which would typically be network issues.

I'm unsure of everyone else's setup, but what seems to have been causing this for me was a misconfiguration on the client node in my cluster. I have two master-eligible nodes, two data-only nodes and one client node (no data, non-master, meant for load balancing etc..).

The problem appears to have been that I had in my elasticsearch.yml:

node.max_local_storage_nodes: 1
<!code>

which doesn't make sense for a node that's not allowed to perform local data storage. When the client node would enter the cluster, all devices would be updated of the client node properties, e.g.

[2016-09-26 13:06:25,386][INFO ][cluster.service ] [hyd-mon-storage01] added {{load-balance-node}{Pp9lrLb2S-W9VDY4d_zsKg}{10.191.4.126}{10.191.4.126:9300}{max_local_storage_nodes=1, data=false, master=false},}, reason: zen-disco-receive(from master [{phys-node}{PFjAWYe9T_W_VsmKm-hcFQ}{10.191.5.129}{10.191.5.129:9300}{max_local_storage_nodes=1, master=true}])
<!code>

So they'd see it has storage nodes, the master would attempt to write to this device but it doesn't accept data, thus the event gets lost. Furthermore, I was seeing my primary index randomly get deleted and changing:

node.max_local_storage_nodes: 0<!code>

appears to have resolved this issue for me.

If you're still seeing this error in your cluster, I'd attempt to recreate the path an event takes from logstash into your cluster, all the way through to a data node.. there's likely some misconfig causing the errors. I'll keep updating this thread with any further results and conclusions I reach.


(Andrew Stoker) #18

I have the same issue. Quick search of upcoming 2.4.1 release may have the fix https://github.com/elastic/elasticsearch/pull/20585


(Andrew Stoker) #19

I can confirm that with the 2.4.1 release this error has been resolved.


(Seth S) #20

Also confirmed that upgrading to 2.4.1 resolves this issue.