Adding new node to cluster is not so easy -- new node is busted

Life was beautiful on a single node but I index 300gb of logs per day and after ten or eleven days I added a new storage node. The new node doesnt work. I had high hopes for the cluster to automagically provision it but I guess it needs help or I need to change my expectations.

I am using ES 2.1.1 on java 1.7

The nodes are on the same same ip subnet. The original node seems to see the new node but the new node complains about missing indexes and has other errors including messages about a node leaving. I can ping and connect to port 9200 & 9300 from each server to the other. There are no firewalls in between.

I have blindly included some logs and data from the cluster in hopes they make sense to someone with a clue. Does anyone know what is going wrong here?

Cluster status information: http://pastebin.com/VWz9jNU7

Logs from new node: http://pastebin.com/wvS8SBZ1

thanks,
j.

Did you configure unicast and a unicast host list on either of the nodes?

I have made some progress on this.

I have added a dedicated master node, plus the master on the existing data node for a total of two. I changed the minimum master count to 2. I no longer see the second data node disconnecting.

However, the new node seems unable to create indexes. When midnight zulu time rolled around the cluster tried to create a new logstash index on the new data node. The index went red and the cluster stopped.

deleting the new index moved the cluster from red to yellow for about two seconds until the cluster tried to create a new index on the new data node and it went back to red.

I turned down elasticsearch on the new node, deleted the index from the cluster, and life moved on with the new index now on the old node.

how can i determine why the new node is having problems?
es005 = new node
es004 = original node
es005 log can be found here: http://broken.net/pbcluster002.txt

here are the configs:

[root@es005 elasticsearch]# egrep -v ^# elasticsearch.yml
cluster.name: pbcluster002
node.name: es005
bootstrap.mlockall: false
network.host: 192.168.134.29
http.port: 9200
discovery.zen.ping.unicast.hosts: [ "192.168.134.27" ]
discovery.zen.ping.multicast.enabled: false
discovery.zen.minimum_master_nodes: 2
node.master: false
node.data: true


[root@es004 elasticsearch]# egrep -v ^# elasticsearch.yml
cluster.name: pbcluster002
node.name: es004
bootstrap.mlockall: false
network.host: 192.168.134.27
http.port: 9200
discovery.zen.ping.unicast.hosts: ["192.168.134.27", "192.168.134.29" ]
discovery.zen.minimum_master_nodes: 2
threadpool.bulk.queue_size: 1000
index:
number_of_shards: 1
node.master: true
node.data: true

Am I correct in reading that you have exactly two nodes, one of them is set to be master-eligible, the other is not, but you have discovery.zen.minimum_master_nodes set to 2? That will not work, you have to have at least as many master-eligible nodes as you have this setting equal to. Preferably this setting is equal to at least a quorum of your nodes. Note that with two nodes you're in a tricky position because discovery.zen.minimum_master_nodes being set to 1 risks split brain and set to 2 drops the possibility of high availability.

Can you provide the output from the cluster nodes info API?

I am over the first hurdle. The problem was I installed marvel on the first node which required the license plugin. There is some bug where new nodes wont join if they also do not also have the license plugin. Apparently the license plugin changes the cluster metadata and custom cluster metadata has to be synchronized for nodes to join. I installed license & marvel on the new nodes allowed them to join (marvel is now broken, but it was going to break at some point any way).

The very next thing that happened is primary shards starting moving over to the new data node and after twenty minutes the two hot shards from todays logstash index went red. I ran a lucene heathcheck against the shards and they came back clean. In frustration, I excluded the new node (curl -XPUT es005:9200/_cluster/settings -d '{ "transient": {"cluster.routing.allocation.exclude._id": "C1hpqpRzQ4KR6MtlIMGz3w"}}') and deleted the index creating a time gap in the log data which I guess I will have to replay.

So, now the new node has joined the cluster and I am gun shy about putting data on it. What are appropriate levels of logging to turn to determine what is causing the indexes to go red?

thanks,
j.