Every node for itself

Hello,

We have a problem that repeats itself every 5-12 hours period. When
everything running smoothly (cluster is green) 1 node behaves
irrational and every other node creates its own cluster (not 1/4
split, 1/1/1/1/1 split). This cluster mainly used for training so we
have heavy traffic spikes on both reads and writes when jobs are
triggered (also some continious small reads).

  1. What happened to btrainer-1.138 ?
  2. Even if 1 node (btrainer-1.138) behaves irrationally why didn't the
    cluster split by 1/4; why did other nodes lose the master
    btrainer-1.182 ?

Setup :

5 similar nodes :

btrainer-1.182	(192.168.1.182)	(Current Master before incident)
btrainer-1.186 (192.168.1.186)
btrainer-1.136	(192.168.1.136)
btrainer-13.137	(192.168.13.137)
btrainer-1.138	(192.168.1.138)

ES Configs :

cluster.name: btrainer
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [ "192.168.1.182:10300",

"192.168.1.186:10300", "192.168.1.136:10300", "192.168.13.137:10300",
"192.168.1.138:10300" ]
http.port: 10200
index.number_of_replicas: 4
transport.tcp.port: 10300

Java Options :

-Des-foreground=yes
-Des.path.home=/elasticsearch
-Xms4096m
-Xmx20480m
-Djline.enabled=true
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-cp /elasticsearch/lib/*:/elasticsearch/lib/sigar/*
org.elasticsearch.bootstrap.ElasticSearch

you can check the logs from the nodes : https://gist.github.com/3510448

Best Regards,
Ozgur Orhan

--

Forgot to add, we are using version : 0.19.8 .

On Wed, Aug 29, 2012 at 2:07 PM, Özgür Orhan ozgurorhan@gmail.com wrote:

Hello,

We have a problem that repeats itself every 5-12 hours period. When
everything running smoothly (cluster is green) 1 node behaves
irrational and every other node creates its own cluster (not 1/4
split, 1/1/1/1/1 split). This cluster mainly used for training so we
have heavy traffic spikes on both reads and writes when jobs are
triggered (also some continious small reads).

  1. What happened to btrainer-1.138 ?
  2. Even if 1 node (btrainer-1.138) behaves irrationally why didn't the
    cluster split by 1/4; why did other nodes lose the master
    btrainer-1.182 ?

Setup :

    5 similar nodes :

    btrainer-1.182  (192.168.1.182) (Current Master before incident)
    btrainer-1.186 (192.168.1.186)
    btrainer-1.136  (192.168.1.136)
    btrainer-13.137 (192.168.13.137)
    btrainer-1.138  (192.168.1.138)

ES Configs :

    cluster.name: btrainer
    discovery.zen.ping.multicast.enabled: false
    discovery.zen.ping.unicast.hosts: [ "192.168.1.182:10300",

"192.168.1.186:10300", "192.168.1.136:10300", "192.168.13.137:10300",
"192.168.1.138:10300" ]
http.port: 10200
index.number_of_replicas: 4
transport.tcp.port: 10300

Java Options :

    -Des-foreground=yes
    -Des.path.home=/elasticsearch
    -Xms4096m
    -Xmx20480m
    -Djline.enabled=true
    -XX:+UseParNewGC
    -XX:+UseConcMarkSweepGC
    -XX:+CMSParallelRemarkEnabled
    -XX:SurvivorRatio=8
    -XX:MaxTenuringThreshold=1
    -XX:CMSInitiatingOccupancyFraction=75
    -XX:+UseCMSInitiatingOccupancyOnly
    -cp /elasticsearch/lib/*:/elasticsearch/lib/sigar/*
    org.elasticsearch.bootstrap.ElasticSearch

you can check the logs from the nodes : https://gist.github.com/3510448

Best Regards,
Ozgur Orhan

--