Cluster cannot resume automatically after the last member timeout


(Wing) #1

I have 2 node cluster and 1 java process running the redexing jobs for
every minute using TransportClient.

  1. I take down 1 node, and the client can fail over and continue to
    work without problem

  2. suddenly the client have timeout to get local cluster state and
    exception is thrown:

2012-06-13 18:06:39,630 INFO org.elasticsearch.client.transport - [Rick Jones] failed to get local cluster state for [Klaw][NJ8o2F7dQ0eg1JO97b1W0w][inet[/10.1.4.197:9300]], disconnecting... org.elasticsearch.transport.ReceiveTimeoutTransportException: [Klaw][inet[/10.1.4.197:9300]][cluster/state] request_id [46533] timed out after [5001ms] at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:347) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
  1. and from all the subsequent index requests submitted by the
    TransportClient, exception "No node available" is thrown and it cannot
    recover by itself even the node is still there and up.

  2. when I bring the downed node (the one taken down in step 1), the
    TransportClient can detect and resume to conncet to the cluster

Does that mean the TransportClient will not resume when the last
member of the cluster "timeout"?

Wing


(system) #2