Cluster RED, and not recovering


(Brian Dunbar) #1

My cluster is red and it won't get better on it's own.

Given the data returned below .. is it getting better? This is the second time this happened, but the first time was when I had no data in the system and starting over was reasonable. Now .. this is 3 weeks of data intake, easy. I don't want to loose more than I must.

I had four nodes in a cluster, three data, the fourth non-data 1. Node1's disk became full, at the same time I finished adding 100Gb of log file data to the cluster. Without realizing the state of Node1's disk, I turned the live logstash feed from the apache hosts on, then queried the data from Kibana.

Node2 and Node3 went off the air - elasticsearch just stopped. Worried about the state of Node1's disk, I added a large drive to Node4, told it that it was capable of accepting data.

In short order Node1 and Node4 were members of the cluster. Cluster state was 'Red' so I flipped on Node2 and Node3. We now have ..

# curl -XGET 'http://localhost:9200/_cat/nodes'
ris-webstats01 127.0.0.1   25 25 1.32 d * ris-webstats01
ris-webstats03 127.0.0.1    3 63 0.15 d m ris-webstats03
ris-webstats04 10.210.2.96  6  8 0.00 d m ris-webstats04
ris-webstats02 127.0.0.1    1 62 0.00 d m ris-webstats02

# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "muostats",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 711,
  "active_shards" : 711,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 4421,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 3706
}

9200/_cat/indices tells me I have 515 indices 'red', 2 'yellow'.

Shards ..

{
  "_shards": {
    "total": 5138,
    "successful": 711,
    "failed": 0
  }
}

UPDATE 5:21 p.m.

I may have started to fix the problem? I'm not sure, yet.

Found this link at stackoverflow.

I have done this

 curl -s localhost:9200/_cat/shards | grep UNASS | grep marvel
.marvel-2015.06.18  0 p UNASSIGNED
.marvel-2015.06.18  0 r UNASSIGNED
.marvel-2015.06.19  0 p UNASSIGNED
.marvel-2015.06.19  0 r UNASSIGNED
.marvel-kibana      0 p UNASSIGNED
.marvel-kibana      0 r UNASSIGNED

Not caring about marvel indices I then

 curl -XPOST -d '{ "commands" : [ { "allocate" : { "index" : ".marvel-2015.06.18", "shard" : 0, "node" : "ris-webstats01" } } ] }' http://localhost:9200/_cluster/reroute?pretty
{
  "error" : "RemoteTransportException[[ris-webstats01][inet[/10.210.2.26:9300]][cluster:admin/reroute]]; nested: ElasticsearchIllegalArgumentException[[allocate] trying to allocate a primary shard [.marvel-2015.06.18][0], which is disabled]; ",
  "status" : 400
}

I did it again with an index from a day I can rebuild if I must.

curl -XPOST -d '{ "commands" : [ { "allocate" : { "index" : "logstash-2014.10.11", "shard" : 0, "node" : "ris-webstats04" } } ] }' http://localhost:9200/_cluster/reroute?pretty

And the output rolled right into terminal and was lost. Partial capture ..

          "state" : "STARTED",
          "primary" : true,
          "node" : "PUoYWozIQuigdl_0_m7BWQ",
          "relocating_node" : null,
          "shard" : 1,
          "index" : "logstash-2014.11.13"
        } ]
      }
    },
    "allocations" : [ ]
  }
}

I thought 'hmm' and executed the curl again ..

# curl -XPOST -d '{ "commands" : [ { "allocate" : { "index" : "logstash-2014.10.11", "shard" : 0, "node" : "ris-webstats04" } } ] }' http://localhost:9200/_cluster/reroute?pretty
{
  "error" : "RemoteTransportException[[ris-webstats01][inet[/10.210.2.26:9300]][cluster:admin/reroute]]; nested: ElasticsearchIllegalArgumentException[[allocate] failed to find [logstash-2014.10.11][0] on the list of unassigned shards]; ",
  "status" : 400
}

Update 7:44

After I executed the above task, the number of unassigned shards dropped to the current level of '6', two hours later.

{
  "cluster_name" : "muostats",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 2563,
  "active_shards" : 5126,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 2
}

# curl -s localhost:9200/_cat/shards | grep UNASS
logstash-2014.12.08 0 r UNASSIGNED
logstash-2014.12.10 1 r UNASSIGNED
logstash-2014.09.24 0 r UNASSIGNED
logstash-2014.12.26 1 r UNASSIGNED
logstash-2015.01.12 0 r UNASSIGNED
logstash-2014.10.20 1 r UNASSIGNED

I'm going to eat dinner and think but I'd really like to know exactly what I did .. if anything. Were the unassigned shards being assigned for me?

Update: Next Day 10:49

Last night I halted elasticsearch on ris-webstats01 and the cluster immediately turned 'green'. I thought 'great, I've got two working ES nodes and a 'green' cluster' and went to bed.

This morning I started ris-webstats04 .. and it failed to join the cluster. Started ris-webstats01: same.

So confused right now.

Update to the Update

I changed the value 'discovery.zen.minimum_master_nodes:' from derfault value '1' to '3' and this is what '04' is telling me in the log

[2015-07-16 08:54:15,098][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-16 08:54:16,046][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]

1 Intended use was to run the monitoring plugins


(Mike Simos) #2

Yes the unassigned shards were being started for you. In the very first _cluster/health you see initializing_shards: 6. So at any one time 6 starts are being initialized. With 5126 shards (primary + replica). And only 4 data nodes, you may have too many shards per node depending on the size of your shards. And if you have slow disks it can take a long time to start the shards.

Can you connect to port 9300 from webstats04/01 to the elected master node? Are you using unicast or multicast?


(Brian Dunbar) #3

'telnet IP 9300' from 01 and 04 connects ok.

Unicast.

When I start elasticsearch on ris-webstats01 it errors as follows.

Now: there is no ../node/3 - there is a ../node/1 and ../node/2.

[2015-07-17 09:53:45,524][INFO ][node                     ] [ris-webstats01] version[1.6.0], pid[25859], build[cdd3ac4/2015-06-09T13:36:34Z]
[2015-07-17 09:53:45,524][INFO ][node                     ] [ris-webstats01] initializing ...
[2015-07-17 09:53:45,530][INFO ][plugins                  ] [ris-webstats01] loaded [], sites []
[2015-07-17 09:53:45,564][ERROR][bootstrap                ] Exception
org.elasticsearch.ElasticsearchIllegalStateException: Failed to created node environment
        at org.elasticsearch.node.internal.InternalNode.<init>(InternalNode.java:164)
        at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:77)
        at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:245)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)
Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch-1.6.0/data/muostats/nodes/3

ris-webstats04 simply claims in it's log

[2015-07-17 10:32:57,803][INFO ][http                     ] [ris-webstats04] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.210.2.96:9200]}
[2015-07-17 10:32:57,803][INFO ][node                     ] [ris-webstats04] started
[2015-07-17 10:32:59,755][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 10:33:24,702][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry

I conjecture this is due to it's node not finding the servers.


(Mike Simos) #4

I'd check the file permissions on /usr/share/elasticsearch-1.6.0/data/muostats/nodes/3 and verify the user which runs the Elasticsearch process has r/w to this directory.


(Brian Dunbar) #5

There is no such directory

Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch-1.6.0/data/muostats/nodes/3

[root@ris-webstats01 nodes]# ls -la  /usr/share/elasticsearch-1.6.0/data/muostats/nodes/3
ls: cannot access /usr/share/elasticsearch-1.6.0/data/muostats/nodes/3: No such file or directory
[root@ris-webstats01 nodes]# ls -la  /usr/share/elasticsearch-1.6.0/data/muostats/nodes
total 20
drwxrwxr-x 5 es es 4096 Jun 26 16:33 .
drwxrwxr-x 3 es es 4096 Jun 18 16:02 ..
drwxrwxr-x 4 es es 4096 Jul 13 12:08 0
drwxrwxr-x 3 es es 4096 Jun 26 16:18 1
drwxrwxr-x 3 es es 4096 Jun 26 16:33 2
[root@ris-webstats01 nodes]#

(Nemo) #6

Did you check your configuration file? Please make sure node.master: true. From your previous log it looks like there are no master nodes.

Can you paste the output of below command

curl -XGET 'http://localhost:9200/_nodes'


(Brian Dunbar) #7

Replay Part One

elasticsearch.yml

The Bad Boys

ris-webstats01. The host that won't run ES because it can't find a file that does not exist

cluster.name: muostats
node.name: "ris-webstats01"
path.data: /usr/share/elasticsearch-1.6.0/data
bootstrap.mlockall: true
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.210.2.26","10.210.2.97","10.210.2.98","10.210.2.96"]

ris-webstats04. The new server that won't join.

cluster.name: muostats
node.name: "ris-webstats04"
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.210.2.97","10.210.2.98","10.210.2.96"]

The Two That Are Acting Good

ris-webstats02. My good, good boy.

cluster.name: muostats
node.name: "ris-webstats02"
node.data: true
path.data: /home/esdata/data
bootstrap.mlockall: true
network.publish_host: 10.210.2.98
network.host: 10.210.2.98
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.210.2.26","10.210.2.97","10.210.2.98","10.210.2.96"]

ris-webstats03. My other good boy.

cluster.name: muostats
node.name: "ris-webstats03"
node.data: true
path.data: /home/esdata/data
bootstrap.mlockall: true
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.210.2.26","10.210.2.97","10.210.2.98","10.210.2.96"]


(Brian Dunbar) #8

Reply Part Two

ris-webstats01. The host that won't run ES because it can't find a file that does not exist

curl -XGET 'http://localhost:9200/_nodes'
curl: (7) couldn't connect to host

ris-webstats04. The new server that won't join.

# curl -XGET 'http://localhost:9200/_nodes?pretty'
{
  "cluster_name" : "muostats",
  "nodes" : {
    "IsuM4F_NQU6pYJPcwJrE4g" : {
      "name" : "ris-webstats04",
      "transport_address" : "inet[/10.210.2.96:9300]",
      "host" : "ris-webstats04",
      "ip" : "10.210.2.96",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[/10.210.2.96:9200]",
      "settings" : {
        "pidfile" : "/var/run/elasticsearch/elasticsearch.pid",
        "path" : {
          "conf" : "/etc/elasticsearch",
          "data" : "/var/lib/elasticsearch",
          "logs" : "/var/log/elasticsearch",
          "work" : "/tmp/elasticsearch",
          "home" : "/usr/share/elasticsearch"
        },
        "cluster" : {
          "name" : "muostats"
        },
        "node" : {
          "name" : "ris-webstats04"
        },
        "discovery" : {
          "zen" : {
            "minimum_master_nodes" : "3",
            "ping" : {
              "multicast" : {
                "enabled" : "false"
              },
              "unicast" : {
                "hosts" : [ "10.210.2.97", "10.210.2.98", "10.210.2.96" ]
              }
            }
          }
        },
        "name" : "ris-webstats04",
        "client" : {
          "type" : "node"
        },
        "config" : {
          "ignore_system_properties" : "true"
        }
      },
      "os" : {
        "refresh_interval_in_millis" : 1000,
        "available_processors" : 1,
        "cpu" : {
          "vendor" : "Intel",
          "model" : "Xeon",
          "mhz" : 2394,
          "total_cores" : 1,
          "total_sockets" : 1,
          "cores_per_socket" : 32,
          "cache_size_in_bytes" : 12288
        },
        "mem" : {
          "total_in_bytes" : 4069855232
        },
        "swap" : {
          "total_in_bytes" : 1715466240
        }
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 2249,
        "max_file_descriptors" : 65535,
        "mlockall" : false
      },
      "jvm" : {
        "pid" : 2249,
        "version" : "1.8.0_40",
        "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
        "vm_version" : "25.40-b25",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1437154340378,
        "mem" : {
          "heap_init_in_bytes" : 268435456,
          "heap_max_in_bytes" : 1065025536,
          "non_heap_init_in_bytes" : 2555904,
          "non_heap_max_in_bytes" : 0,
          "direct_max_in_bytes" : 1065025536
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Metaspace", "Compressed Class Space", "Par Eden Space", "Par Survivor Space", "CMS Old Gen" ]
      },
      "thread_pool" : {
        "percolate" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : "1k"
        },
        "fetch_shard_started" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 2,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "listener" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "index" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : "200"
        },
        "refresh" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "suggest" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : "1k"
        },
        "generic" : {
          "type" : "cached",
          "keep_alive" : "30s",
          "queue_size" : -1
        },
        "warmer" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "search" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "flush" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "optimize" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "fetch_shard_store" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 2,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "management" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 5,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "get" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : "1k"
        },
        "merge" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "bulk" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : "50"
        },
        "snapshot" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        }
      },
      "network" : {
        "refresh_interval_in_millis" : 5000,
        "primary_interface" : {
          "address" : "10.210.2.96",
          "name" : "eth0",
          "mac_address" : "D2:00:0C:12:7C:48"
        }
      },
      "transport" : {
        "bound_address" : "inet[/0:0:0:0:0:0:0:0:9300]",
        "publish_address" : "inet[/10.210.2.96:9300]",
        "profiles" : { }
      },
      "http" : {
        "bound_address" : "inet[/0:0:0:0:0:0:0:0:9200]",
        "publish_address" : "inet[/10.210.2.96:9200]",
        "max_content_length_in_bytes" : 104857600
      },
      "plugins" : [ {
        "name" : "paramedic",
        "version" : "NA",
        "description" : "No description found.",
        "url" : "/_plugin/paramedic/",
        "jvm" : false,
        "site" : true
      }, {
        "name" : "head",
        "version" : "NA",
        "description" : "No description found.",
        "url" : "/_plugin/head/",
        "jvm" : false,
        "site" : true
      }, {
        "name" : "HQ",
        "version" : "NA",
        "description" : "No description found.",
        "url" : "/_plugin/HQ/",
        "jvm" : false,
        "site" : true
      }, {
        "name" : "bigdesk",
        "version" : "NA",
        "description" : "No description found.",
        "url" : "/_plugin/bigdesk/",
        "jvm" : false,
        "site" : true
      } ]
    }
  }
}

(Brian Dunbar) #9

Reply Part Three

ris-webstats02. My good, good boy.

# curl -XGET 'http://localhost:9200/_nodes?pretty'
curl: (7) couldn't connect to host
[root@ris-webstats02 elasticsearch]# curl -XGET 'http://10.210.2.98:9200/_nodes?pretty'
{
  "cluster_name" : "muostats",
  "nodes" : {
    "LbUW8yCYTDWwDvVVWqH0Gw" : {
      "name" : "ris-webstats03",
      "transport_address" : "inet[/10.210.2.97:9301]",
      "host" : "ris-webstats03",
      "ip" : "127.0.0.1",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[/10.210.2.97:9200]",
      "settings" : {
        "path" : {
          "data" : "/home/esdata/data",
          "logs" : "/usr/share/elasticsearch/logs",
          "home" : "/usr/share/elasticsearch"
        },
        "cluster" : {
          "name" : "muostats"
        },
        "node" : {
          "name" : "ris-webstats03",
          "data" : "true"
        },
        "discovery" : {
          "zen" : {
            "ping" : {
              "multicast" : {
                "enabled" : "false"
              },
              "unicast" : {
                "hosts" : [ "10.210.2.26", "10.210.2.97", "10.210.2.98", "10.210.2.96" ]
              }
            }
          }
        },
        "name" : "ris-webstats03",
        "client" : {
          "type" : "node"
        },
        "foreground" : "yes",
        "bootstrap" : {
          "mlockall" : "true"
        },
        "config" : {
          "ignore_system_properties" : "true"
        }
      },
      "os" : {
        "refresh_interval_in_millis" : 1000,
        "available_processors" : 2,
        "cpu" : {
          "vendor" : "Intel",
          "model" : "Xeon",
          "mhz" : 2394,
          "total_cores" : 2,
          "total_sockets" : 2,
          "cores_per_socket" : 32,
          "cache_size_in_bytes" : 12288
        },
        "mem" : {
          "total_in_bytes" : 10331058176
        },
        "swap" : {
          "total_in_bytes" : 0
        }
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 3566,
        "max_file_descriptors" : 1024000,
        "mlockall" : true
      },
      "jvm" : {
        "pid" : 3566,
        "version" : "1.8.0_40",
        "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
        "vm_version" : "25.40-b25",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1436995968706,
        "mem" : {
          "heap_init_in_bytes" : 5368709120,
          "heap_max_in_bytes" : 5351276544,
          "non_heap_init_in_bytes" : 2555904,
          "non_heap_max_in_bytes" : 0,
          "direct_max_in_bytes" : 5351276544
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Metaspace", "Compressed Class Space", "Par Eden Space", "Par Survivor Space", "CMS Old Gen" ]
      },
      "thread_pool" : {
        "percolate" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "fetch_shard_started" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "listener" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "index" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "200"
        },
        "refresh" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "suggest" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "generic" : {
          "type" : "cached",
          "keep_alive" : "30s",
          "queue_size" : -1
        },
        "warmer" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "search" : {
          "type" : "fixed",
          "min" : 4,
          "max" : 4,
          "queue_size" : "1k"
        },
        "flush" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "optimize" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "fetch_shard_store" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "management" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 5,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "get" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "merge" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "bulk" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "50"
        },
        "snapshot" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        }
      },
      "network" : {
        "refresh_interval_in_millis" : 5000,
        "primary_interface" : {
          "address" : "10.210.2.97",
          "name" : "eth0",
          "mac_address" : "FA:FB:89:41:8A:21"
        }
      },
      "transport" : {
        "bound_address" : "inet[/0:0:0:0:0:0:0:0%0:9301]",
        "publish_address" : "inet[/10.210.2.97:9301]",
        "profiles" : { }
      },
      "http" : {
        "bound_address" : "inet[/0:0:0:0:0:0:0:0%0:9200]",
        "publish_address" : "inet[/10.210.2.97:9200]",
        "max_content_length_in_bytes" : 104857600
      },
      "plugins" : [ {
        "name" : "HQ",
        "version" : "NA",
        "description" : "No description found.",
        "url" : "/_plugin/HQ/",
        "jvm" : false,
        "site" : true
      } ]
    },
    "vM787Ta4SBufnG6wsp8ahQ" : {
      "name" : "ris-webstats02",
      "transport_address" : "inet[/10.210.2.98:9301]",
      "host" : "ris-webstats02",
      "ip" : "127.0.0.1",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[/10.210.2.98:9200]",
      "settings" : {
        "pidfile" : "/var/run/elasticsearch/elasticsearch.pid",
        "path" : {
          "conf" : "/etc/elasticsearch",
          "data" : "/home/esdata/data",
          "logs" : "/var/log/elasticsearch",
          "work" : "/tmp/elasticsearch",
          "home" : "/usr/share/elasticsearch"
        },
        "cluster" : {
          "name" : "muostats"
        },
        "node" : {
          "name" : "ris-webstats02",
          "data" : "true"
        },
        "discovery" : {
          "zen" : {
            "ping" : {
              "multicast" : {
                "enabled" : "false"
              },
              "unicast" : {
                "hosts" : [ "10.210.2.26", "10.210.2.97", "10.210.2.98", "10.210.2.96" ]
              }
            }
          }
        },
        "name" : "ris-webstats02",
        "client" : {
          "type" : "node"
        },
        "bootstrap" : {
          "mlockall" : "true"
        },
        "config" : {
          "ignore_system_properties" : "true"
        },
        "network" : {
          "host" : "10.210.2.98",
          "publish_host" : "10.210.2.98"
        }
      },
      "os" : {
        "refresh_interval_in_millis" : 1000,
        "available_processors" : 2,
        "cpu" : {
          "vendor" : "Intel",
          "model" : "Xeon",
          "mhz" : 2660,
          "total_cores" : 2,
          "total_sockets" : 2,
          "cores_per_socket" : 4,
          "cache_size_in_bytes" : 6144
        },
        "mem" : {
          "total_in_bytes" : 10331058176
        },
        "swap" : {
          "total_in_bytes" : 0
        }
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 6147,
        "max_file_descriptors" : 65535,
        "mlockall" : true
      },
      "jvm" : {
        "pid" : 6147,
        "version" : "1.8.0_40",
        "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
        "vm_version" : "25.40-b25",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1436993849277,
        "mem" : {
          "heap_init_in_bytes" : 5368709120,
          "heap_max_in_bytes" : 5351276544,
          "non_heap_init_in_bytes" : 2555904,
          "non_heap_max_in_bytes" : 0,
          "direct_max_in_bytes" : 5351276544
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Metaspace", "Compressed Class Space", "Par Eden Space", "Par Survivor Space", "CMS Old Gen" ]
      },
      "thread_pool" : {
        "percolate" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "fetch_shard_started" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "listener" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "index" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "200"
        },
        "refresh" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "suggest" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "generic" : {
          "type" : "cached",
          "keep_alive" : "30s",
          "queue_size" : -1
        },
        "warmer" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "search" : {
          "type" : "fixed",
          "min" : 4,
          "max" : 4,
          "queue_size" : "1k"
        },
        "flush" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "optimize" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "fetch_shard_store" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "management" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 5,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "get" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "merge" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "bulk" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "50"
        },
        "snapshot" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        }
      },
      "network" : {
        "refresh_interval_in_millis" : 5000,
        "primary_interface" : {
          "address" : "10.210.2.98",
          "name" : "eth0",
          "mac_address" : "32:32:9A:47:F6:28"
        }
      },
      "transport" : {
        "bound_address" : "inet[/10.210.2.98:9301]",
        "publish_address" : "inet[/10.210.2.98:9301]",
        "profiles" : { }
      },
      "http" : {
        "bound_address" : "inet[/10.210.2.98:9200]",
        "publish_address" : "inet[/10.210.2.98:9200]",
        "max_content_length_in_bytes" : 104857600
      },
      "plugins" : [ ]
    }
  }
}
[root@ris-webstats02 elasticsearch]#

ris-webstats03. My other good boy.

# curl -XGET 'http://localhost:9200/_nodes?pretty'
{
  "cluster_name" : "muostats",
  "nodes" : {
    "LbUW8yCYTDWwDvVVWqH0Gw" : {
      "name" : "ris-webstats03",
      "transport_address" : "inet[/10.210.2.97:9301]",
      "host" : "ris-webstats03",
      "ip" : "127.0.0.1",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[/10.210.2.97:9200]",
      "settings" : {
        "path" : {
          "data" : "/home/esdata/data",
          "logs" : "/usr/share/elasticsearch/logs",
          "home" : "/usr/share/elasticsearch"
        },
        "cluster" : {
          "name" : "muostats"
        },
        "node" : {
          "name" : "ris-webstats03",
          "data" : "true"
        },
        "discovery" : {
          "zen" : {
            "ping" : {
              "multicast" : {
                "enabled" : "false"
              },
              "unicast" : {
                "hosts" : [ "10.210.2.26", "10.210.2.97", "10.210.2.98", "10.210.2.96" ]
              }
            }
          }
        },
        "name" : "ris-webstats03",
        "client" : {
          "type" : "node"
        },
        "foreground" : "yes",
        "bootstrap" : {
          "mlockall" : "true"
        },
        "config" : {
          "ignore_system_properties" : "true"
        }
      },
      "os" : {
        "refresh_interval_in_millis" : 1000,
        "available_processors" : 2,
        "cpu" : {
          "vendor" : "Intel",
          "model" : "Xeon",
          "mhz" : 2394,
          "total_cores" : 2,
          "total_sockets" : 2,
          "cores_per_socket" : 32,
          "cache_size_in_bytes" : 12288
        },
        "mem" : {
          "total_in_bytes" : 10331058176
        },
        "swap" : {
          "total_in_bytes" : 0
        }
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 3566,
        "max_file_descriptors" : 1024000,
        "mlockall" : true
      },
      "jvm" : {
        "pid" : 3566,
        "version" : "1.8.0_40",
        "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
        "vm_version" : "25.40-b25",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1436995968706,
        "mem" : {
          "heap_init_in_bytes" : 5368709120,
          "heap_max_in_bytes" : 5351276544,
          "non_heap_init_in_bytes" : 2555904,
          "non_heap_max_in_bytes" : 0,
          "direct_max_in_bytes" : 5351276544
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Metaspace", "Compressed Class Space", "Par Eden Space", "Par Survivor Space", "CMS Old Gen" ]
      },
      "thread_pool" : {
        "percolate" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "fetch_shard_started" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "listener" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "index" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "200"
        },
        "refresh" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "suggest" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "generic" : {
          "type" : "cached",
          "keep_alive" : "30s",
          "queue_size" : -1
        },
        "warmer" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "search" : {
          "type" : "fixed",
          "min" : 4,
          "max" : 4,
          "queue_size" : "1k"
        },
        "flush" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "optimize" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "fetch_shard_store" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "management" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 5,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "get" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "merge" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "bulk" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "50"
        },
        "snapshot" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        }
      },
      "network" : {
        "refresh_interval_in_millis" : 5000,
        "primary_interface" : {
          "address" : "10.210.2.97",
          "name" : "eth0",
          "mac_address" : "FA:FB:89:41:8A:21"
        }
      },
      "transport" : {
        "bound_address" : "inet[/0:0:0:0:0:0:0:0:9301]",
        "publish_address" : "inet[/10.210.2.97:9301]",
        "profiles" : { }
      },
      "http" : {
        "bound_address" : "inet[/0:0:0:0:0:0:0:0:9200]",
        "publish_address" : "inet[/10.210.2.97:9200]",
        "max_content_length_in_bytes" : 104857600
      },
      "plugins" : [ {
        "name" : "HQ",
        "version" : "NA",
        "description" : "No description found.",
        "url" : "/_plugin/HQ/",
        "jvm" : false,
        "site" : true
      } ]
    },
    "vM787Ta4SBufnG6wsp8ahQ" : {
      "name" : "ris-webstats02",
      "transport_address" : "inet[/10.210.2.98:9301]",
      "host" : "ris-webstats02",
      "ip" : "127.0.0.1",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[/10.210.2.98:9200]",
      "settings" : {
        "pidfile" : "/var/run/elasticsearch/elasticsearch.pid",
        "path" : {
          "conf" : "/etc/elasticsearch",
          "data" : "/home/esdata/data",
          "logs" : "/var/log/elasticsearch",
          "work" : "/tmp/elasticsearch",
          "home" : "/usr/share/elasticsearch"
        },
        "cluster" : {
          "name" : "muostats"
        },
        "node" : {
          "name" : "ris-webstats02",
          "data" : "true"
        },
        "discovery" : {
          "zen" : {
            "ping" : {
              "multicast" : {
                "enabled" : "false"
              },
              "unicast" : {
                "hosts" : [ "10.210.2.26", "10.210.2.97", "10.210.2.98", "10.210.2.96" ]
              }
            }
          }
        },
        "name" : "ris-webstats02",
        "client" : {
          "type" : "node"
        },
        "bootstrap" : {
          "mlockall" : "true"
        },
        "config" : {
          "ignore_system_properties" : "true"
        },
        "network" : {
          "host" : "10.210.2.98",
          "publish_host" : "10.210.2.98"
        }
      },
      "os" : {
        "refresh_interval_in_millis" : 1000,
        "available_processors" : 2,
        "cpu" : {
          "vendor" : "Intel",
          "model" : "Xeon",
          "mhz" : 2660,
          "total_cores" : 2,
          "total_sockets" : 2,
          "cores_per_socket" : 4,
          "cache_size_in_bytes" : 6144
        },
        "mem" : {
          "total_in_bytes" : 10331058176
        },
        "swap" : {
          "total_in_bytes" : 0
        }
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 6147,
        "max_file_descriptors" : 65535,
        "mlockall" : true
      },
      "jvm" : {
        "pid" : 6147,
        "version" : "1.8.0_40",
        "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
        "vm_version" : "25.40-b25",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1436993849277,
        "mem" : {
          "heap_init_in_bytes" : 5368709120,
          "heap_max_in_bytes" : 5351276544,
          "non_heap_init_in_bytes" : 2555904,
          "non_heap_max_in_bytes" : 0,
          "direct_max_in_bytes" : 5351276544
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Metaspace", "Compressed Class Space", "Par Eden Space", "Par Survivor Space", "CMS Old Gen" ]
      },
      "thread_pool" : {
        "percolate" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "fetch_shard_started" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "listener" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "index" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "200"
        },
        "refresh" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "suggest" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "generic" : {
          "type" : "cached",
          "keep_alive" : "30s",
          "queue_size" : -1
        },
        "warmer" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "search" : {
          "type" : "fixed",
          "min" : 4,
          "max" : 4,
          "queue_size" : "1k"
        },
        "flush" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "optimize" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "fetch_shard_store" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "management" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 5,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "get" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "merge" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "bulk" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "50"
        },
        "snapshot" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        }
      },
      "network" : {
        "refresh_interval_in_millis" : 5000,
        "primary_interface" : {
          "address" : "10.210.2.98",
          "name" : "eth0",
          "mac_address" : "32:32:9A:47:F6:28"
        }
      },
      "transport" : {
        "bound_address" : "inet[/10.210.2.98:9301]",
        "publish_address" : "inet[/10.210.2.98:9301]",
        "profiles" : { }
      },
      "http" : {
        "bound_address" : "inet[/10.210.2.98:9200]",
        "publish_address" : "inet[/10.210.2.98:9200]",
        "max_content_length_in_bytes" : 104857600
      },
      "plugins" : [ ]
    }
  }
}
[root@ris-webstats03 config]#

(Nemo) #10

Can you add "10.210.2.26" to hosts list and add publish_address for everyone. I can see only 10.210.2.96, 10.210.2.97, 10.210.2.98 but not 10.210.2.26. Please add corresponding publish_host and node.master = true for master eligible nodes explicitly and try?


(Brian Dunbar) #11

Taking these one host a a time. Added publish_host to ris-webstats04 and ris-webstats03

ris-webstats03

cluster.name: muostats
node.name: "ris-webstats03"
node.data: true
path.data: /home/esdata/data
bootstrap.mlockall: true
network.publish_host: 10.210.2.97
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.210.2.26","10.210.2.97","10.210.2.98","10.210.2.96"]

ris-webstats04

cluster.name: muostats
node.name: "ris-webstats04"
network.publish_host: 10.210.2.96
network.host: 10.210.2.96
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.210.2.26","10.210.2.97","10.210.2.98","10.210.2.96"]

And LO they joined together to be as one.

curl http://ris-webstats02.int.domain.com:9200/_cat/nodes
ris-webstats04 10.210.2.96 31 24 0.26 d m ris-webstats04
ris-webstats02 127.0.0.1   75 64 1.76 d * ris-webstats02
ris-webstats03 127.0.0.1   37 64 1.81 d m ris-webstats03

And I see that I have shards being assigned to the formerly non-data webstats04

[root@ris-webstats04 elasticsearch]# curl -s 10.210.2.96:9200/_cat/shards | grep ris-webstats04
logstash-2013.12.31 2 r STARTED        51036  26.4mb 10.210.2.96 ris-webstats04
logstash-2013.12.31 0 r STARTED        51061  26.4mb 10.210.2.96 ris-webstats04
logstash-2013.12.31 3 r STARTED        51031  26.4mb 10.210.2.96 ris-webstats04
logstash-2013.12.31 1 r STARTED        51023  26.4mb 10.210.2.96 ris-webstats04

Currently I have 948 unassigned shards (and falling); it looks as if my next problem is ris-webstats01 and it's mysterious desire to want a directory that does not exist.


(Brian Dunbar) #12

Thought: if I unset path.data and/or point it to a new path, one that is writeable but otherwise empty of data, will the cluster start rebalancing shards and parcel the data back to ris-webstats01 ?


(Nemo) #13

What about ris-webstats01? I think this is your bad-boy! Did you apply the same changes to it? Did you try pinging each other? Make sure they are reachable. What directory is missing?
If it is dev env, remove all data and restart all node once. Please show your initial logs at the time of bootup (All nodes).


(Brian Dunbar) #14

Did you apply the same changes to it?

Yes:
ris-webstats01

cluster.name: muostats
node.name: "ris-webstats01"
path.data: /usr/share/elasticsearch-1.6.0/data
bootstrap.mlockall: true
network.publish_host: 10.210.2.26
network.host: 10.210.2.26
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.210.2.26","10.210.2.97","10.210.2.98","10.210.2.96"]

Did you try pinging each other?

01 can telnet to all of the IP:9300.
02, 03, 04 can ping 01

What directory is missing?

/usr/share/elasticsearch-1.6.0/data/muostats/nodes/3

If it is dev env, remove all data and restart all node once

It's not.

Question: the data is replicated on 02, 03, and 04. What happens on 01 if I simply delete the directory and start over?

Please show your initial logs at the time of bootup

ris-webstats01

[2015-07-17 14:57:08,461][WARN ][bootstrap                ] Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out. Increase RLIMIT_MEMLOCK (ulimit).
[2015-07-17 14:57:08,536][INFO ][node                     ] [ris-webstats01] version[1.6.0], pid[26479], build[cdd3ac4/2015-06-09T13:36:34Z]
[2015-07-17 14:57:08,536][INFO ][node                     ] [ris-webstats01] initializing ...
[2015-07-17 14:57:08,540][INFO ][plugins                  ] [ris-webstats01] loaded [], sites []
[2015-07-17 14:57:08,573][ERROR][bootstrap                ] Exception
org.elasticsearch.ElasticsearchIllegalStateException: Failed to created node environment
	at org.elasticsearch.node.internal.InternalNode.<init>(InternalNode.java:164)
	at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:77)
	at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:245)
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)
Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch-1.6.0/data/muostats/nodes/3
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
	at java.nio.file.Files.createDirectory(Files.java:674)
	at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
	at java.nio.file.Files.createDirectories(Files.java:767)
	at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:126)
	at org.elasticsearch.node.internal.InternalNode.<init>(InternalNode.java:162)
	... 4 more 

ris-webstats02

[2015-07-15 13:46:58,298][DEBUG][action.bulk              ] [ris-webstats02] observer timed out. notifying listener. timeout setting [1m], time since start [20.3m]
[2015-07-15 13:55:34,044][WARN ][bootstrap                ] Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out. Increase RLIMIT_MEMLOCK (ulimit).
[2015-07-15 13:55:34,228][INFO ][node                     ] [ris-webstats02] version[1.6.0], pid[6005], build[cdd3ac4/2015-06-09T13:36:34Z]
[2015-07-15 13:55:34,228][INFO ][node                     ] [ris-webstats02] initializing ...
[2015-07-15 13:55:34,236][INFO ][plugins                  ] [ris-webstats02] loaded [], sites []
[2015-07-15 13:55:34,352][INFO ][env                      ] [ris-webstats02] using [1] data paths, mounts [[/home (/dev/mapper/VolGroup-lv_home)]], net usable_space [155.8gb], net total_space [454.2gb], types [ext4]
[2015-07-15 13:57:36,492][INFO ][node                     ] [ris-webstats02] version[1.6.0], pid[6147], build[cdd3ac4/2015-06-09T13:36:34Z]
[2015-07-15 13:57:36,493][INFO ][node                     ] [ris-webstats02] initializing ...
[2015-07-15 13:57:36,498][INFO ][plugins                  ] [ris-webstats02] loaded [], sites []
[2015-07-15 13:57:36,554][INFO ][env                      ] [ris-webstats02] using [1] data paths, mounts [[/home (/dev/mapper/VolGroup-lv_home)]], net usable_space [155.8gb], net total_space [454.2gb], types [ext4]
[2015-07-15 13:57:49,409][INFO ][node                     ] [ris-webstats02] initialized
[2015-07-15 13:57:49,413][INFO ][node                     ] [ris-webstats02] starting ...
[2015-07-15 13:57:49,787][INFO ][transport                ] [ris-webstats02] bound_address {inet[/10.210.2.98:9301]}, publish_address {inet[/10.210.2.98:9301]}
[2015-07-15 13:57:49,821][INFO ][discovery                ] [ris-webstats02] muostats/vM787Ta4SBufnG6wsp8ahQ
[2015-07-15 13:57:53,049][INFO ][cluster.service          ] [ris-webstats02] detected_master [ris-webstats01][PUoYWozIQuigdl_0_m7BWQ][ris-webstats01][inet[/10.210.2.26:9300]], added {[ris-webstats01][PUoYWozIQuigdl_0_m7BWQ][ris-webstats01][inet[/10.210.2.26:9300]],[ris-webstats04][7_Q1v-VJREiOxtKJAL2jYw][ris-webstats04][inet[/10.210.2.96:9300]],}, reason: zen-disco-receive(from master [[ris-webstats01][PUoYWozIQuigdl_0_m7BWQ][ris-webstats01][inet[/10.210.2.26:9300]]])
[2015-07-15 13:57:53,435][INFO ][http                     ] [ris-webstats02] bound_address {inet[/10.210.2.98:9200]}, publish_address {inet[/10.210.2.98:9200]}
[2015-07-15 13:57:53,435][INFO ][node                     ] [ris-webstats02] started

ris-webstats03

[es@ris-webstats03 ~]$ [2015-07-17 14:19:40,615][INFO ][node                     ] [ris-webstats03] version[1.6.0], pid[12580], build[cdd3ac4/2015-06-09T13:36:34Z]
[2015-07-17 14:19:40,616][INFO ][node                     ] [ris-webstats03] initializing ...
[2015-07-17 14:19:40,624][INFO ][plugins                  ] [ris-webstats03] loaded [], sites [HQ]
[2015-07-17 14:19:40,674][INFO ][env                      ] [ris-webstats03] using [1] data paths, mounts [[/home (/dev/mapper/VolGroup-lv_home)]], net usable_space [85.4gb], net total_space [454.2gb], types [ext4]
[2015-07-17 14:19:45,600][INFO ][node                     ] [ris-webstats03] initialized
[2015-07-17 14:19:45,601][INFO ][node                     ] [ris-webstats03] starting ...
[2015-07-17 14:19:45,827][INFO ][transport                ] [ris-webstats03] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.210.2.97:9301]}
[2015-07-17 14:19:45,867][INFO ][discovery                ] [ris-webstats03] muostats/xvHXNwH5Ts-coYui54gvlA
[2015-07-17 14:19:49,092][INFO ][cluster.service          ] [ris-webstats03] detected_master [ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]], added {[ris-webstats04][Nhfg_pI3RuOrsZODB2eTYw][ris-webstats04][inet[/10.210.2.96:9300]],[ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]],}, reason: zen-disco-receive(from master [[ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]]])
[2015-07-17 14:19:49,640][INFO ][http                     ] [ris-webstats03] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.210.2.97:9200]}
[2015-07-17 14:19:49,641][INFO ][node                     ] [ris-webstats03] started

ris-webstats04

[2015-07-17 14:11:27,909][INFO ][node                     ] [ris-webstats04] version[1.6.0], pid[2516], build[cdd3ac4/2015-06-09T13:36:34Z]
[2015-07-17 14:11:27,910][INFO ][node                     ] [ris-webstats04] initializing ...
[2015-07-17 14:11:27,919][INFO ][plugins                  ] [ris-webstats04] loaded [], sites [paramedic, head, HQ, bigdesk]

[2015-07-17 14:11:28,020][INFO ][env                      ] [ris-webstats04] using [1] data paths, mounts [[/ (/dev/mapper/VolGroup-lv_root)]], net usable_space [468gb], net total_space [504.7gb], types [ext4]
[2015-07-17 14:11:34,397][INFO ][node                     ] [ris-webstats04] initialized
[2015-07-17 14:11:34,399][INFO ][node                     ] [ris-webstats04] starting ...
[2015-07-17 14:11:34,500][INFO ][transport                ] [ris-webstats04] bound_address {inet[/10.210.2.96:9300]}, publish_address {inet[/10.210.2.96:9300]}
[2015-07-17 14:11:34,528][INFO ][discovery                ] [ris-webstats04] muostats/Nhfg_pI3RuOrsZODB2eTYw
[2015-07-17 14:12:04,528][WARN ][discovery                ] [ris-webstats04] waited for 30s and no initial state was set by the discovery
[2015-07-17 14:12:04,536][INFO ][http                     ] [ris-webstats04] bound_address {inet[/10.210.2.96:9200]}, publish_address {inet[/10.210.2.96:9200]}
[2015-07-17 14:12:04,540][INFO ][node                     ] [ris-webstats04] started
[2015-07-17 14:12:09,700][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:12:35,690][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:12:39,702][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:12:39,854][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:12:46,692][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:12:52,688][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:12:58,687][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:03,689][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:05,690][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:13:09,687][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:09,855][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:13:10,091][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:16,693][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:13:19,687][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:22,689][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:13:25,685][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:28,688][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:13:30,787][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:33,689][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:13:36,687][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:39,688][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:13:40,091][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:13:40,208][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:48,842][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:49,687][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:13:54,686][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:13:55,685][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:13:59,686][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:14:00,787][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:14:05,684][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:14:06,688][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:14:10,209][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:14:10,290][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:14:16,682][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:14:18,842][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:14:21,693][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:14:24,687][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:14:27,685][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:14:29,687][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:14:33,685][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:14:35,686][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:14:39,680][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:14:40,290][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:14:40,494][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:14:46,683][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:14:50,682][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 14:14:51,693][DEBUG][action.admin.cluster.state] [ris-webstats04] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2015-07-17 14:14:56,324][INFO ][cluster.service          ] [ris-webstats04] detected_master [ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]], added {[ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]],}, reason: zen-disco-receive(from master [[ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]]])
[2015-07-17 14:18:29,451][INFO ][cluster.service          ] [ris-webstats04] added {[ris-webstats03][xvHXNwH5Ts-coYui54gvlA][ris-webstats03][inet[/10.210.2.97:9301]],}, reason: zen-disco-receive(from master [[ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]]])

(Nemo) #15

Thank you for your patience :slight_smile:

I can infer two statements seeing your log.

  1. There are two master node created! SPLIT BRAIN! And this is not at all suggested.

[2015-07-17 14:14:56,324][INFO ][cluster.service ] [ris-webstats04] detected_master [ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]], added {[ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]],}, reason: zen-disco-receive(from master [[ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]]])
[2015-07-17 14:18:29,451][INFO ][cluster.service ] [ris-webstats04] added {[ris-webstats03][xvHXNwH5Ts-coYui54gvlA][ris-webstats03][inet[/10.210.2.97:9301]],}, reason: zen-disco-receive(from master [[ris-webstats02][vM787Ta4SBufnG6wsp8ahQ][ris-webstats02][inet[/10.210.2.98:9301]]])

[2015-07-15 13:57:53,049][INFO ][cluster.service ] [ris-webstats02] detected_master [ris-webstats01][PUoYWozIQuigdl_0_m7BWQ][ris-webstats01][inet[/10.210.2.26:9300]], added {[ris-webstats01][PUoYWozIQuigdl_0_m7BWQ][ris-webstats01][inet[/10.210.2.26:9300]],[ris-webstats04][7_Q1v-VJREiOxtKJAL2jYw][ris-webstats04][inet[/10.210.2.96:9300]],}, reason: zen-disco-receive(from master [[ris-webstats01][PUoYWozIQuigdl_0_m7BWQ][ris-webstats01][inet[/10.210.2.26:9300]]])
[2015-07-15 13:57:53,435][INFO ][http ] [ris-webstats02] bound_address {inet[/10.210.2.98:9200]}, publish_address {inet[/10.210.2.98:9200]}

Please fix this first!

  1. You are not able to see /usr/share/elasticsearch-1.6.0/data/muostats/nodes/3 dir because it is failing to create the directory. Please double check the permission of ES to create the directory.

(Brian Dunbar) #16

Right! How? Restart ES on one of the master?

Edit

I restarted 04, and it would not rejoin. I restarted 03 and it would not rejoin.

03 Log

[es@ris-webstats03 config]$ [2015-07-17 17:27:17,277][INFO ][node                     ] [ris-webstats03] version[1.6.0], pid[19695], build[cdd3ac4/2015-06-09T13:36:34Z]
[2015-07-17 17:27:17,279][INFO ][node                     ] [ris-webstats03] initializing ...
[2015-07-17 17:27:17,287][INFO ][plugins                  ] [ris-webstats03] loaded [], sites [HQ]
[2015-07-17 17:27:17,342][INFO ][env                      ] [ris-webstats03] using [1] data paths, mounts [[/home (/dev/mapper/VolGroup-lv_home)]], net usable_space [85.4gb], net total_space [454.2gb], types [ext4]
[2015-07-17 17:27:32,694][INFO ][node                     ] [ris-webstats03] initialized
[2015-07-17 17:27:32,695][INFO ][node                     ] [ris-webstats03] starting ...
[2015-07-17 17:27:32,782][INFO ][transport                ] [ris-webstats03] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.210.2.97:9301]}
[2015-07-17 17:27:32,799][INFO ][discovery                ] [ris-webstats03] muostats/1un_zNVKSX6ftHY0kho_9w
[2015-07-17 17:27:59,881][WARN ][discovery.zen.ping.unicast] [ris-webstats03] failed to send ping to [[#zen_unicast_3#][ris-webstats03][inet[/10.210.2.98:9300]]]
org.elasticsearch.transport.SendRequestTransportException: [][inet[/10.210.2.98:9300]][internal:discovery/zen/unicast_gte_1_4]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPingRequestTo14NodeWithFallback(UnicastZenPing.java:431)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPings(UnicastZenPing.java:413)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.ping(UnicastZenPing.java:219)
        at org.elasticsearch.discovery.zen.ping.ZenPingService.ping(ZenPingService.java:146)
        at org.elasticsearch.discovery.zen.ping.ZenPingService.pingAndWait(ZenPingService.java:124)
        at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:996)
        at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:360)
        at org.elasticsearch.discovery.zen.ZenDiscovery.access$6100(ZenDiscovery.java:85)
        at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1373)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [][inet[/10.210.2.98:9300]] Node not connected
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:964)
        at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:656)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
        ... 12 more
[2015-07-17 17:28:02,800][WARN ][discovery                ] [ris-webstats03] waited for 30s and no initial state was set by the discovery
[2015-07-17 17:28:02,807][INFO ][http                     ] [ris-webstats03] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.210.2.97:9200]}
[2015-07-17 17:28:02,807][INFO ][node                     ] [ris-webstats03] started
[2015-07-17 17:29:08,760][DEBUG][action.admin.cluster.state] [ris-webstats03] no known master node, scheduling a retry
[2015-07-17 17:29:17,998][WARN ][discovery.zen.ping.unicast] [ris-webstats03] failed to send ping to [[#zen_unicast_3#][ris-webstats03][inet[/10.210.2.98:9300]]]
org.elasticsearch.transport.SendRequestTransportException: [][inet[/10.210.2.98:9300]][internal:discovery/zen/unicast_gte_1_4]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPingRequestTo14NodeWithFallback(UnicastZenPing.java:431)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPings(UnicastZenPing.java:413)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.ping(UnicastZenPing.java:219)
        at org.elasticsearch.discovery.zen.ping.ZenPingService.ping(ZenPingService.java:146)
        at org.elasticsearch.discovery.zen.ping.ZenPingService.pingAndWait(ZenPingService.java:124)
        at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:996)
        at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:360)
        at org.elasticsearch.discovery.zen.ZenDiscovery.access$6100(ZenDiscovery.java:85)
        at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1373)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [][inet[/10.210.2.98:9300]] Node not connected
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:964)
        at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:656)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
        ... 12 more
[2015-07-17 17:29:30,014][WARN ][discovery.zen.ping.unicast] [ris-webstats03] failed to send ping to [[#zen_unicast_3#][ris-webstats03][inet[/10.210.2.98:9300]]]
org.elasticsearch.transport.SendRequestTransportException: [][inet[/10.210.2.98:9300]][internal:discovery/zen/unicast_gte_1_4]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPingRequestTo14NodeWithFallback(UnicastZenPing.java:431)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPings(UnicastZenPing.java:413)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.ping(UnicastZenPing.java:219)
        at org.elasticsearch.discovery.zen.ping.ZenPingService.ping(ZenPingService.java:146)
        at org.elasticsearch.discovery.zen.ping.ZenPingService.pingAndWait(ZenPingService.java:124)
        at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:996)
        at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:360)
        at org.elasticsearch.discovery.zen.ZenDiscovery.access$6100(ZenDiscovery.java:85)
        at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1373)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [][inet[/10.210.2.98:9300]] Node not connected
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:964)
        at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:656)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
        ... 12 more
[2015-07-17 17:29:36,021][WARN ][discovery.zen.ping.unicast] [ris-webstats03] failed to send ping to [[#zen_unicast_3#][ris-webstats03][inet[/10.210.2.98:9300]]]
org.elasticsearch.transport.SendRequestTransportException: [][inet[/10.210.2.98:9300]][internal:discovery/zen/unicast_gte_1_4]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPingRequestTo14NodeWithFallback(UnicastZenPing.java:431)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPings(UnicastZenPing.java:413)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.ping(UnicastZenPing.java:219)
        at org.elasticsearch.discovery.zen.ping.ZenPingService.ping(ZenPingService.java:146)
        at org.elasticsearch.discovery.zen.ping.ZenPingService.pingAndWait(ZenPingService.java:124)
        at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:996)
        at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:360)
        at org.elasticsearch.discovery.zen.ZenDiscovery.access$6100(ZenDiscovery.java:85)
        at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1373)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [][inet[/10.210.2.98:9300]] Node not connected
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:964)
        at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:656)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
        ... 12 more
[2015-07-17 17:29:36,569][DEBUG][action.admin.cluster.state] [ris-webstats03] no known master node, scheduling a retry
[2015-07-17 17:29:38,763][DEBUG][action.admin.cluster.state] [ris-webstats03] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]

04 Log

[2015-07-17 17:33:29,599][INFO ][node                     ] [ris-webstats04] version[1.6.0], pid[3003], build[cdd3ac4/2015-06-09T13:36:34Z]
[2015-07-17 17:33:29,603][INFO ][node                     ] [ris-webstats04] initializing ...
[2015-07-17 17:33:29,615][INFO ][plugins                  ] [ris-webstats04] loaded [], sites [paramedic, head, HQ, bigdesk]
[2015-07-17 17:33:29,715][INFO ][env                      ] [ris-webstats04] using [1] data paths, mounts [[/ (/dev/mapper/VolGroup-lv_root)]], net usable_space [402.9gb], net total_space [504.7gb], types [ext4]
[2015-07-17 17:33:36,386][INFO ][node                     ] [ris-webstats04] initialized
[2015-07-17 17:33:36,390][INFO ][node                     ] [ris-webstats04] starting ...
[2015-07-17 17:33:36,510][INFO ][transport                ] [ris-webstats04] bound_address {inet[/10.210.2.96:9300]}, publish_address {inet[/10.210.2.96:9300]}
[2015-07-17 17:33:36,539][INFO ][discovery                ] [ris-webstats04] muostats/E8HUYNxeQEWOigqOtFs7cg
[2015-07-17 17:34:06,539][WARN ][discovery                ] [ris-webstats04] waited for 30s and no initial state was set by the discovery
[2015-07-17 17:34:06,546][INFO ][http                     ] [ris-webstats04] bound_address {inet[/10.210.2.96:9200]}, publish_address {inet[/10.210.2.96:9200]}
[2015-07-17 17:34:06,549][INFO ][node                     ] [ris-webstats04] started
[2015-07-17 17:34:06,864][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 17:34:06,948][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 17:34:06,974][DEBUG][action.admin.cluster.health] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 17:34:07,030][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry
[2015-07-17 17:34:07,933][DEBUG][action.admin.cluster.state] [ris-webstats04] no known master node, scheduling a retry

Edit

So I have three nodes, two of them were masters, causing a split-brain. Now what?

This is like that nightmare where no matter what things just get worse and worse. I'm getting dinner, inserting a pause in the operation, before I accidentally format the drivers or something. Back in an hour.

After Dinner Edit

I thought about it. Read a little. Specifically this page where it told me to set the quorum to be the number of master-eligible nodes / 2) + 1.

Set this value in each of the three nodes that work.

discovery.zen.minimum_master_nodes: 2

Told all the hosts they can be master i.e .default. Started the three surviving hosts .. 02 and 04 came up, decide that 02 should be master. 03 came online and decided IT should be master and I had two masters again. I killed 03.

So now i have this while the cluster assigns shards.

ris-webstats02 127.0.0.1   55 40 6.50 d m ris-webstats02
ris-webstats04 10.210.2.96 69 32 4.66 d * ris-webstats04

Should min master nodes be 3 in a 3 host cluster?


(Brian Dunbar) #17

You were absolutely correct, and I was wrong.

I have set nodes 01 and 03 with this value and both have launched and joined the cluster.

node.master: false

Cleary I need to understand the quorum value and re-think my architecture.


(Nemo) #18

Glad that it worked! :slight_smile:


(system) #19