Two node Cluster / Logstash / Kibana4 / Puppet: Unknown Cluster problems

linuxmail · September 16, 2015, 2:10pm

hi,

I'm new to Elasticsearch and trying to replace my logserver with two Debian Jessie server. I had everything (Elastic (1.7.2) /Logstash(.5.4-1)/Kibana4 (4.0.1)) running with a single node installation via Puppet manifest.
Now I tried to get it working, with two nodes (two hosts), but all services are very fragile and I think, the main problem is a bad cluster configuration.
For example: Kibana4 won't start / gives error messages or logstash throws a lot of ":message=>"retrying failed action with response code: 503", :" and so on.

Both nodes/hosts can communicate and all needed ports are open via iptables. Kibana4 uses http://localhost:9200. For logstash I tried protocol "transport" and "http" ...

Host/Node 1

MANAGED BY PUPPET

cluster:
name: informatiklog
routing:
allocation:
awareness:
attributes: rack
discovery:
zen:
minimum_master_nodes: 1
ping:
multicast:
enabled: false
timeout: 30s
unicast:
hosts:
- elasearch-01
- elasearch-02
gateway:
expected_nodes: 1
recover_after_nodes: 2
recover_after_time: 5m
type: local
http:
host: 127.0.0.1
index:
number_of_replicas: 2
node:
name: elasearch-01
path:
data: /usr/share/elasticsearch/data/log-fb
transport:
host: 10.172.0.19
Node2 / Host 2

MANAGED BY PUPPET

cluster:
name: informatiklog
routing:
allocation:
awareness:
attributes: rack
discovery:
zen:
minimum_master_nodes: 1
ping:
multicast:
enabled: false
timeout: 30s
unicast:
hosts:
- elasearch-01
- elasearch-02
gateway:
expected_nodes: 1
recover_after_nodes: 2
recover_after_time: 5m
type: local
http:
host: 127.0.0.1
index:
number_of_replicas: 2
node:
name: elasearch-02
path:
data: /usr/share/elasticsearch/data/log-fb
transport:
host: 10.172.0.20

(Because of the body limit -> Logs: http://pastebin.com/HEfKev8z)

Physical, the nodes are a VM on two physical Proxmox hosts. The VMs are in a teststate and if I get everything working (via Puppet), I would reinstall both nodes with more ram / CPU and space.
The goal is a failover setup with haproxy/nginx/keepalived ...

So, what I have done wrong?

magnusbaeck · September 16, 2015, 2:19pm

What's the cluster's health like?

linuxmail · September 16, 2015, 2:26pm

hi Magnus,

elasearch-01:~# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "informatiklog",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 17,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

I rebooted both hosts. -> "status" : "red",

magnusbaeck · September 16, 2015, 2:34pm

Well, that certainly explains why Logstash and Kibana don't work. The question is, why isn't ES allocating any of the 17 unassigned shards? Don't the ES logs contain any clues? Were those shards present in the original cluster before you converted it to a two-node cluster? And how did you do that? I wonder if you somehow managed to keep the cluster state (i.e. metadata about the 17 shards) but scrapped the actual shard data.

linuxmail · September 16, 2015, 6:17pm

hi Magnus,

you pointed me to the correct problem. I've deleted the whole index and the cluster becomes "green". In the second I reload the Kibana page, the cluster state becomes "red". The problem is/was the ".kibana" index. I acually don't know enough about Elastic, so I tried some other things ... Kibana loads correct, but the cluster state is "yellow"

elasearch-01:~# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "informatiklog",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 6,
  "active_shards" : 12,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 5,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
} 

  curl -XGET http://localhost:9200/_cat/shards
.kibana             0 p STARTED       1 2.5kb 10.172.0.20 elasearch-02
.kibana             0 r STARTED       1 2.5kb 10.172.0.19 elasearch-01
logstash-2015.09.16 2 p STARTED    9826 3.7mb 10.172.0.20 elasearch-02
logstash-2015.09.16 2 r STARTED    9826 3.7mb 10.172.0.19 elasearch-01
logstash-2015.09.16 2 r UNASSIGNED
logstash-2015.09.16 0 p STARTED    9867 3.8mb 10.172.0.20 elasearch-02
logstash-2015.09.16 0 r STARTED    9867 3.7mb 10.172.0.19 elasearch-01
logstash-2015.09.16 0 r UNASSIGNED
logstash-2015.09.16 3 r STARTED    9897 3.7mb 10.172.0.20 elasearch-02
logstash-2015.09.16 3 p STARTED    9897 3.7mb 10.172.0.19 elasearch-01
logstash-2015.09.16 3 r UNASSIGNED
logstash-2015.09.16 1 r STARTED    9861 3.8mb 10.172.0.20 elasearch-02
logstash-2015.09.16 1 p STARTED    9861 3.8mb 10.172.0.19 elasearch-01
logstash-2015.09.16 1 r UNASSIGNED
logstash-2015.09.16 4 p STARTED    9833 3.7mb 10.172.0.20 elasearch-02
logstash-2015.09.16 4 r STARTED    9833 3.7mb 10.172.0.19 elasearch-01
logstash-2015.09.16 4 r UNASSIGNED

So, I think, I get in a split brain situation ....

The config now:

cluster:
  name: informatiklog
discovery:
  zen:
    minimum_master_nodes: 1
    ping:
      multicast:
        enabled: false
      timeout: 30s
      unicast:
        hosts:
             - elasearch-01
             - elasearch-02
gateway:
  expected_nodes: 2
  recover_after_nodes: 2
  recover_after_time: 5m
http:
  host: 127.0.0.1
index:
  number_of_replicas: 2
node:
  name: elasearch-01
path:
  data: /usr/share/elasticsearch/data/log-fb
transport:
  host: 10.172.0.19

cu denny

linuxmail · September 17, 2015, 9:37am

hi,

I was able to fix it ... maybe. I misunderstood the "number_of_replicas" parameter. I thought of two hosts, so I wrote "2", but that isn't correct. I have only one master + one replica, so the correct parameter is "number_of_replicas: 1". After deleting the whole index again, change logstash a bit, I got everything up and running again. The only thing is, I don't know, if everything is correct.

cu denny

magnusbaeck · September 17, 2015, 10:23am

You're right in that a two-node cluster shouldn't have more than one replica, but that alone couldn't have been the reason for your cluster being red. Having a replica count that's above what the cluster can handle will result in a yellow cluster, but that's not critical.

Note that replica counts can be dynamically modified for any index at any time.

Topic		Replies	Views
Excess shards Elasticsearch	5	1642	July 6, 2017
ElasticSearch cluster with four nodes Elasticsearch	16	5161	July 5, 2017
2 nodes instead of one Elasticsearch	9	519	July 6, 2017
Help with Cluster Elasticsearch	7	378	July 6, 2017
Installed and configured ElasticSearch (Multi Node) Cluster on CentOS Elasticsearch	25	5661	July 5, 2017

Two node Cluster / Logstash / Kibana4 / Puppet: Unknown Cluster problems

MANAGED BY PUPPET

MANAGED BY PUPPET

Related topics