Disconnected transport client on ES 0.19.0


(davie) #1

Hi,
We recently upgraded to elasticsearch 0.19.0 and have been seeing the
following errors occassionally:

JsQS0K1nRNCb8GNgA][inet[/192.168.1.100:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[prod-node-prod][inet[/192.168.1.100:9300]][cluster/nodes/info] request_id
[143146] timed out after [5002ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:347)

    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown

Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)

We have 2 nodes in the cluster, and are connecting using the transport
client.
We can reproduce this error by running a query that takes tens of seconds
to complete. Then if we increase the value of
elasticsearch.client.transport.ping_timeout on the client to longer than
the query takes to return we don't get the error.

It looks like this may be related to the node disconnect on timeout
introduced in
https://github.com/elasticsearch/elasticsearch/commit/eb4f6709d97287b0c9de6af9bf5f4a42fcf98991but
I don't really understand what that timeout should do - could you
explain?
We can't reproduce the issue using curl, the long running queries return
without any problem.

We realise that long running queries should probably be using scan, so we
will switch to use that, but it would be nice to understand the timeout
behaviour too.

Thanks,
Davie


(Shay Banon) #2

When you say a query that takes long time to return, do you mean a query
that returns a very large resultset, or simply a query that takes long to
execute but returns a "not so large" result set? We can improve things by
using a different channel to do the ping, I will do it, but still, would be
interesting to investigate this further.

On Wed, Mar 28, 2012 at 6:53 PM, Davie Moston daviemoston@gmail.com wrote:

Hi,
We recently upgraded to elasticsearch 0.19.0 and have been seeing the
following errors occassionally:

JsQS0K1nRNCb8GNgA][inet[/192.168.1.100:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[prod-node-prod][inet[/192.168.1.100:9300]][cluster/nodes/info]
request_id [143146] timed out after [5002ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:347)

    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown

Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)

We have 2 nodes in the cluster, and are connecting using the transport
client.
We can reproduce this error by running a query that takes tens of seconds
to complete. Then if we increase the value of
elasticsearch.client.transport.ping_timeout on the client to longer than
the query takes to return we don't get the error.

It looks like this may be related to the node disconnect on timeout
introduced in
https://github.com/elasticsearch/elasticsearch/commit/eb4f6709d97287b0c9de6af9bf5f4a42fcf98991but I don't really understand what that timeout should do - could you
explain?
We can't reproduce the issue using curl, the long running queries return
without any problem.

We realise that long running queries should probably be using scan, so we
will switch to use that, but it would be nice to understand the timeout
behaviour too.

Thanks,
Davie


(davie) #3

Thanks for the reply.
The result set itself is fairly large.
Issuing the same query with curl returns around 145 meg of json, which
is roughly 15 thousand documents. This takes around a minute over
http, and data starts being returned almost immediately.
We also see intermittent failures for index requests using the same
client which we suspect are related as they happen at the same time.
In this case we see a nodedisconnected exception.
For this query we should probably switch to using scan/scroll, but I
had expected it to use the timeout passed in with the query rather
than causing a ping timeout.
This failure is new in 0.19.0, we were previously running 0.18.7.0 and
it worked fine.
Let me know if you need any more info.
Thanks,
Davie

On Mar 29, 1:17 pm, Shay Banon kim...@gmail.com wrote:

When you say a query that takes long time to return, do you mean a query
that returns a very large resultset, or simply a query that takes long to
execute but returns a "not so large" result set? We can improve things by
using a different channel to do the ping, I will do it, but still, would be
interesting to investigate this further.

On Wed, Mar 28, 2012 at 6:53 PM, Davie Moston daviemos...@gmail.com wrote:

Hi,
We recently upgraded to elasticsearch 0.19.0 and have been seeing the
following errors occassionally:

JsQS0K1nRNCb8GNgA][inet[/192.168.1.100:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[prod-node-prod][inet[/192.168.1.100:9300]][cluster/nodes/info]
request_id [143146] timed out after [5002ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportSe rvice.java:347)

    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown

Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)

We have 2 nodes in the cluster, and are connecting using the transport
client.
We can reproduce this error by running a query that takes tens of seconds
to complete. Then if we increase the value of
elasticsearch.client.transport.ping_timeout on the client to longer than
the query takes to return we don't get the error.

It looks like this may be related to the node disconnect on timeout
introduced in
https://github.com/elasticsearch/elasticsearch/commit/eb4f6709d97287b...I don't really understand what that timeout should do - could you
explain?
We can't reproduce the issue using curl, the long running queries return
without any problem.

We realise that long running queries should probably be using scan, so we
will switch to use that, but it would be nice to understand the timeout
behaviour too.

Thanks,
Davie


(Shay Banon) #4

It might take time to move 145mb on the channel, so possibly the ping
request from the transport client gets delayed because of it (wiht the
default 5 seconds ping timeout). I pushed a change to use the "high"
priority transport channel for ping requests to both master and upcoming
0.19.2.

On Thu, Mar 29, 2012 at 3:12 PM, davie daviemoston@gmail.com wrote:

Thanks for the reply.
The result set itself is fairly large.
Issuing the same query with curl returns around 145 meg of json, which
is roughly 15 thousand documents. This takes around a minute over
http, and data starts being returned almost immediately.
We also see intermittent failures for index requests using the same
client which we suspect are related as they happen at the same time.
In this case we see a nodedisconnected exception.
For this query we should probably switch to using scan/scroll, but I
had expected it to use the timeout passed in with the query rather
than causing a ping timeout.
This failure is new in 0.19.0, we were previously running 0.18.7.0 and
it worked fine.
Let me know if you need any more info.
Thanks,
Davie

On Mar 29, 1:17 pm, Shay Banon kim...@gmail.com wrote:

When you say a query that takes long time to return, do you mean a query
that returns a very large resultset, or simply a query that takes long to
execute but returns a "not so large" result set? We can improve things by
using a different channel to do the ping, I will do it, but still, would
be
interesting to investigate this further.

On Wed, Mar 28, 2012 at 6:53 PM, Davie Moston daviemos...@gmail.com
wrote:

Hi,
We recently upgraded to elasticsearch 0.19.0 and have been seeing the
following errors occassionally:

JsQS0K1nRNCb8GNgA][inet[/192.168.1.100:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[prod-node-prod][inet[/192.168.1.100:9300]][cluster/nodes/info]
request_id [143146] timed out after [5002ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportSe
rvice.java:347)

    at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown

Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)

We have 2 nodes in the cluster, and are connecting using the transport
client.
We can reproduce this error by running a query that takes tens of
seconds

to complete. Then if we increase the value of
elasticsearch.client.transport.ping_timeout on the client to longer
than

the query takes to return we don't get the error.

It looks like this may be related to the node disconnect on timeout
introduced in

https://github.com/elasticsearch/elasticsearch/commit/eb4f6709d97287b...Idon't really understand what that timeout should do - could you

explain?
We can't reproduce the issue using curl, the long running queries
return

without any problem.

We realise that long running queries should probably be using scan, so
we

will switch to use that, but it would be nice to understand the timeout
behaviour too.

Thanks,
Davie


(davie) #5

Thanks for the quick turnaround.
I just tried upgrading the elastic search jar on the client side to the latest 0.19 branch in git, but the problem still seems to be there.
Looking at the change it seems to only affect the client side, but maybe I'm missing something - I'll try upgrading the nodes too and see if that helps.
Thanks,
Davie


(Shay Banon) #6

Yea, it might be a bit trickier to solve... (though we have a step in the
right direction). Open an issue for it? Btw, can you create a standlone
testcase for this? It would help speed things around...

On Fri, Mar 30, 2012 at 3:55 PM, davie daviemoston@gmail.com wrote:

Thanks for the quick turnaround.
I just tried upgrading the elastic search jar on the client side to the
latest 0.19 branch in git, but the problem still seems to be there.
Looking at the change it seems to only affect the client side, but maybe
I'm missing something - I'll try upgrading the nodes too and see if that
helps.
Thanks,
Davie


(davie) #7

I've created an issue for this here
https://github.com/elasticsearch/elasticsearch/issues/1886 and a testcase
here https://gist.github.com/2501687.
We've been running with client.transport.ping_timeout=180s as a workaround
and haven't seen the problem since.

Thanks,
Davie

On Saturday, 31 March 2012 21:49:03 UTC+1, kimchy wrote:

Yea, it might be a bit trickier to solve... (though we have a step in the
right direction). Open an issue for it? Btw, can you create a standlone
testcase for this? It would help speed things around...

On Fri, Mar 30, 2012 at 3:55 PM, davie daviemoston@gmail.com wrote:

Thanks for the quick turnaround.
I just tried upgrading the elastic search jar on the client side to the
latest 0.19 branch in git, but the problem still seems to be there.
Looking at the change it seems to only affect the client side, but maybe
I'm missing something - I'll try upgrading the nodes too and see if that
helps.
Thanks,
Davie


(Shay Banon) #8

Commented on the issue, lets move the discussion there then (just so we
have a single place).

On Thu, Apr 26, 2012 at 9:44 PM, davie daviemoston@gmail.com wrote:

I've created an issue for this here
https://github.com/elasticsearch/elasticsearch/issues/1886 and a testcase
here https://gist.github.com/2501687.
We've been running with client.transport.ping_timeout=180s as a workaround
and haven't seen the problem since.

Thanks,
Davie

On Saturday, 31 March 2012 21:49:03 UTC+1, kimchy wrote:

Yea, it might be a bit trickier to solve... (though we have a step in the
right direction). Open an issue for it? Btw, can you create a standlone
testcase for this? It would help speed things around...

On Fri, Mar 30, 2012 at 3:55 PM, davie daviemoston@gmail.com wrote:

Thanks for the quick turnaround.
I just tried upgrading the elastic search jar on the client side to the
latest 0.19 branch in git, but the problem still seems to be there.
Looking at the change it seems to only affect the client side, but maybe
I'm missing something - I'll try upgrading the nodes too and see if that
helps.
Thanks,
Davie


(system) #9