0.90.2: Discovery BUG when network outage occurs?


(amos.wood) #1

Background

Using ES for quite a while now, we have multiple situations where network
outages have occurred which resulted in the data nodes not wanting to
rejoin the cluster after the network outage was resolved. The only way to
resolve this issue was to reboot the cluster (or maybe just the current
master).

As I attempted to track this issue down to explain it, I have noticed that
quite a few people have posted similar issues on this forum but no one was
able to resolve it.

TESTING SCENARIO

I have a small cluster (2 master/data nodes) running 0.90.2 using unicast
discovery.

Steps to Reproduce

  1. From one of the nodes running on my local laptop, I connect to the
    cluster as a 3rd non-data node.
  2. I pull the network plug out of the back of my laptop and wait until I
    start to get "transport.netty" exception which is ~45 seconds.
  3. I then plug up the network again and wait until the initial
    connection is made again to the cluster to discover the master.
  4. I then unplug the network again before the cluster state has been
    successfully updated from the master.
  5. It then fails getting the master cluster state, but it doesn't
    continue trying to reconnect again. It will never attempt to reconnect
    again and you have to reboot the 3rd non-data node to reconnect.

Log

The log file for the 3rd non-data node is attached.

Conclusion

Since I can successfully reproduce this issue, is it a bug or that expected?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #2