Two node Cluster / Logstash / Kibana4 / Puppet: Unknown Cluster problems

hi,

I'm new to Elasticsearch and trying to replace my logserver with two Debian Jessie server. I had everything (Elastic (1.7.2) /Logstash(.5.4-1)/Kibana4 (4.0.1)) running with a single node installation via Puppet manifest.
Now I tried to get it working, with two nodes (two hosts), but all services are very fragile and I think, the main problem is a bad cluster configuration.
For example: Kibana4 won't start / gives error messages or logstash throws a lot of ":message=>"retrying failed action with response code: 503", :" and so on.

Both nodes/hosts can communicate and all needed ports are open via iptables. Kibana4 uses http://localhost:9200. For logstash I tried protocol "transport" and "http" ...

  • Host/Node 1

    MANAGED BY PUPPET


    cluster:
    name: informatiklog
    routing:
    allocation:
    awareness:
    attributes: rack
    discovery:
    zen:
    minimum_master_nodes: 1
    ping:
    multicast:
    enabled: false
    timeout: 30s
    unicast:
    hosts:
    - elasearch-01
    - elasearch-02
    gateway:
    expected_nodes: 1
    recover_after_nodes: 2
    recover_after_time: 5m
    type: local
    http:
    host: 127.0.0.1
    index:
    number_of_replicas: 2
    node:
    name: elasearch-01
    path:
    data: /usr/share/elasticsearch/data/log-fb
    transport:
    host: 10.172.0.19

  • Node2 / Host 2

    MANAGED BY PUPPET


    cluster:
    name: informatiklog
    routing:
    allocation:
    awareness:
    attributes: rack
    discovery:
    zen:
    minimum_master_nodes: 1
    ping:
    multicast:
    enabled: false
    timeout: 30s
    unicast:
    hosts:
    - elasearch-01
    - elasearch-02
    gateway:
    expected_nodes: 1
    recover_after_nodes: 2
    recover_after_time: 5m
    type: local
    http:
    host: 127.0.0.1
    index:
    number_of_replicas: 2
    node:
    name: elasearch-02
    path:
    data: /usr/share/elasticsearch/data/log-fb
    transport:
    host: 10.172.0.20

(Because of the body limit -> Logs: http://pastebin.com/HEfKev8z)

Physical, the nodes are a VM on two physical Proxmox hosts. The VMs are in a teststate and if I get everything working (via Puppet), I would reinstall both nodes with more ram / CPU and space.
The goal is a failover setup with haproxy/nginx/keepalived ...

So, what I have done wrong?

What's the cluster's health like?

hi Magnus,

elasearch-01:~# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "informatiklog",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 17,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

I rebooted both hosts. -> "status" : "red",

Well, that certainly explains why Logstash and Kibana don't work. The question is, why isn't ES allocating any of the 17 unassigned shards? Don't the ES logs contain any clues? Were those shards present in the original cluster before you converted it to a two-node cluster? And how did you do that? I wonder if you somehow managed to keep the cluster state (i.e. metadata about the 17 shards) but scrapped the actual shard data.

hi Magnus,

you pointed me to the correct problem. I've deleted the whole index and the cluster becomes "green". In the second I reload the Kibana page, the cluster state becomes "red". The problem is/was the ".kibana" index. I acually don't know enough about Elastic, so I tried some other things ... Kibana loads correct, but the cluster state is "yellow"

elasearch-01:~# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "informatiklog",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 6,
  "active_shards" : 12,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 5,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
} 

  curl -XGET http://localhost:9200/_cat/shards
.kibana             0 p STARTED       1 2.5kb 10.172.0.20 elasearch-02
.kibana             0 r STARTED       1 2.5kb 10.172.0.19 elasearch-01
logstash-2015.09.16 2 p STARTED    9826 3.7mb 10.172.0.20 elasearch-02
logstash-2015.09.16 2 r STARTED    9826 3.7mb 10.172.0.19 elasearch-01
logstash-2015.09.16 2 r UNASSIGNED
logstash-2015.09.16 0 p STARTED    9867 3.8mb 10.172.0.20 elasearch-02
logstash-2015.09.16 0 r STARTED    9867 3.7mb 10.172.0.19 elasearch-01
logstash-2015.09.16 0 r UNASSIGNED
logstash-2015.09.16 3 r STARTED    9897 3.7mb 10.172.0.20 elasearch-02
logstash-2015.09.16 3 p STARTED    9897 3.7mb 10.172.0.19 elasearch-01
logstash-2015.09.16 3 r UNASSIGNED
logstash-2015.09.16 1 r STARTED    9861 3.8mb 10.172.0.20 elasearch-02
logstash-2015.09.16 1 p STARTED    9861 3.8mb 10.172.0.19 elasearch-01
logstash-2015.09.16 1 r UNASSIGNED
logstash-2015.09.16 4 p STARTED    9833 3.7mb 10.172.0.20 elasearch-02
logstash-2015.09.16 4 r STARTED    9833 3.7mb 10.172.0.19 elasearch-01
logstash-2015.09.16 4 r UNASSIGNED

So, I think, I get in a split brain situation ....

The config now:

cluster:
  name: informatiklog
discovery:
  zen:
    minimum_master_nodes: 1
    ping:
      multicast:
        enabled: false
      timeout: 30s
      unicast:
        hosts:
             - elasearch-01
             - elasearch-02
gateway:
  expected_nodes: 2
  recover_after_nodes: 2
  recover_after_time: 5m
http:
  host: 127.0.0.1
index:
  number_of_replicas: 2
node:
  name: elasearch-01
path:
  data: /usr/share/elasticsearch/data/log-fb
transport:
  host: 10.172.0.19

cu denny

hi,

I was able to fix it ... maybe. I misunderstood the "number_of_replicas" parameter. I thought of two hosts, so I wrote "2", but that isn't correct. I have only one master + one replica, so the correct parameter is "number_of_replicas: 1". After deleting the whole index again, change logstash a bit, I got everything up and running again. The only thing is, I don't know, if everything is correct.

cu denny

You're right in that a two-node cluster shouldn't have more than one replica, but that alone couldn't have been the reason for your cluster being red. Having a replica count that's above what the cluster can handle will result in a yellow cluster, but that's not critical.

Note that replica counts can be dynamically modified for any index at any time.

1 Like