Hi,
We recently upgraded to elasticsearch 0.19.0 and have been seeing the
following errors occassionally:
JsQS0K1nRNCb8GNgA][inet[/192.168.1.100:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[prod-node-prod][inet[/192.168.1.100:9300]][cluster/nodes/info] request_id
[143146] timed out after [5002ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:347)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
We have 2 nodes in the cluster, and are connecting using the transport
client.
We can reproduce this error by running a query that takes tens of seconds
to complete. Then if we increase the value of
elasticsearch.client.transport.ping_timeout on the client to longer than
the query takes to return we don't get the error.
We realise that long running queries should probably be using scan, so we
will switch to use that, but it would be nice to understand the timeout
behaviour too.
When you say a query that takes long time to return, do you mean a query
that returns a very large resultset, or simply a query that takes long to
execute but returns a "not so large" result set? We can improve things by
using a different channel to do the ping, I will do it, but still, would be
interesting to investigate this further.
Hi,
We recently upgraded to elasticsearch 0.19.0 and have been seeing the
following errors occassionally:
JsQS0K1nRNCb8GNgA][inet[/192.168.1.100:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[prod-node-prod][inet[/192.168.1.100:9300]][cluster/nodes/info]
request_id [143146] timed out after [5002ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:347)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
We have 2 nodes in the cluster, and are connecting using the transport
client.
We can reproduce this error by running a query that takes tens of seconds
to complete. Then if we increase the value of
elasticsearch.client.transport.ping_timeout on the client to longer than
the query takes to return we don't get the error.
We realise that long running queries should probably be using scan, so we
will switch to use that, but it would be nice to understand the timeout
behaviour too.
Thanks for the reply.
The result set itself is fairly large.
Issuing the same query with curl returns around 145 meg of json, which
is roughly 15 thousand documents. This takes around a minute over
http, and data starts being returned almost immediately.
We also see intermittent failures for index requests using the same
client which we suspect are related as they happen at the same time.
In this case we see a nodedisconnected exception.
For this query we should probably switch to using scan/scroll, but I
had expected it to use the timeout passed in with the query rather
than causing a ping timeout.
This failure is new in 0.19.0, we were previously running 0.18.7.0 and
it worked fine.
Let me know if you need any more info.
Thanks,
Davie
When you say a query that takes long time to return, do you mean a query
that returns a very large resultset, or simply a query that takes long to
execute but returns a "not so large" result set? We can improve things by
using a different channel to do the ping, I will do it, but still, would be
interesting to investigate this further.
Hi,
We recently upgraded to elasticsearch 0.19.0 and have been seeing the
following errors occassionally:
JsQS0K1nRNCb8GNgA][inet[/192.168.1.100:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[prod-node-prod][inet[/192.168.1.100:9300]][cluster/nodes/info]
request_id [143146] timed out after [5002ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportSe rvice.java:347)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
We have 2 nodes in the cluster, and are connecting using the transport
client.
We can reproduce this error by running a query that takes tens of seconds
to complete. Then if we increase the value of
elasticsearch.client.transport.ping_timeout on the client to longer than
the query takes to return we don't get the error.
It looks like this may be related to the node disconnect on timeout
introduced in https://github.com/elasticsearch/elasticsearch/commit/eb4f6709d97287b...I don't really understand what that timeout should do - could you
explain?
We can't reproduce the issue using curl, the long running queries return
without any problem.
We realise that long running queries should probably be using scan, so we
will switch to use that, but it would be nice to understand the timeout
behaviour too.
It might take time to move 145mb on the channel, so possibly the ping
request from the transport client gets delayed because of it (wiht the
default 5 seconds ping timeout). I pushed a change to use the "high"
priority transport channel for ping requests to both master and upcoming
0.19.2.
Thanks for the reply.
The result set itself is fairly large.
Issuing the same query with curl returns around 145 meg of json, which
is roughly 15 thousand documents. This takes around a minute over
http, and data starts being returned almost immediately.
We also see intermittent failures for index requests using the same
client which we suspect are related as they happen at the same time.
In this case we see a nodedisconnected exception.
For this query we should probably switch to using scan/scroll, but I
had expected it to use the timeout passed in with the query rather
than causing a ping timeout.
This failure is new in 0.19.0, we were previously running 0.18.7.0 and
it worked fine.
Let me know if you need any more info.
Thanks,
Davie
When you say a query that takes long time to return, do you mean a query
that returns a very large resultset, or simply a query that takes long to
execute but returns a "not so large" result set? We can improve things by
using a different channel to do the ping, I will do it, but still, would
be
interesting to investigate this further.
Hi,
We recently upgraded to elasticsearch 0.19.0 and have been seeing the
following errors occassionally:
JsQS0K1nRNCb8GNgA][inet[/192.168.1.100:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[prod-node-prod][inet[/192.168.1.100:9300]][cluster/nodes/info]
request_id [143146] timed out after [5002ms]
at
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
We have 2 nodes in the cluster, and are connecting using the transport
client.
We can reproduce this error by running a query that takes tens of
seconds
to complete. Then if we increase the value of
elasticsearch.client.transport.ping_timeout on the client to longer
than
the query takes to return we don't get the error.
It looks like this may be related to the node disconnect on timeout
introduced in
explain?
We can't reproduce the issue using curl, the long running queries
return
without any problem.
We realise that long running queries should probably be using scan, so
we
will switch to use that, but it would be nice to understand the timeout
behaviour too.
Thanks for the quick turnaround.
I just tried upgrading the elastic search jar on the client side to the latest 0.19 branch in git, but the problem still seems to be there.
Looking at the change it seems to only affect the client side, but maybe I'm missing something - I'll try upgrading the nodes too and see if that helps.
Thanks,
Davie
Yea, it might be a bit trickier to solve... (though we have a step in the
right direction). Open an issue for it? Btw, can you create a standlone
testcase for this? It would help speed things around...
Thanks for the quick turnaround.
I just tried upgrading the Elasticsearch jar on the client side to the
latest 0.19 branch in git, but the problem still seems to be there.
Looking at the change it seems to only affect the client side, but maybe
I'm missing something - I'll try upgrading the nodes too and see if that
helps.
Thanks,
Davie
On Saturday, 31 March 2012 21:49:03 UTC+1, kimchy wrote:
Yea, it might be a bit trickier to solve... (though we have a step in the
right direction). Open an issue for it? Btw, can you create a standlone
testcase for this? It would help speed things around...
Thanks for the quick turnaround.
I just tried upgrading the Elasticsearch jar on the client side to the
latest 0.19 branch in git, but the problem still seems to be there.
Looking at the change it seems to only affect the client side, but maybe
I'm missing something - I'll try upgrading the nodes too and see if that
helps.
Thanks,
Davie
On Saturday, 31 March 2012 21:49:03 UTC+1, kimchy wrote:
Yea, it might be a bit trickier to solve... (though we have a step in the
right direction). Open an issue for it? Btw, can you create a standlone
testcase for this? It would help speed things around...
Thanks for the quick turnaround.
I just tried upgrading the Elasticsearch jar on the client side to the
latest 0.19 branch in git, but the problem still seems to be there.
Looking at the change it seems to only affect the client side, but maybe
I'm missing something - I'll try upgrading the nodes too and see if that
helps.
Thanks,
Davie
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.