New ELK env. stopping working

Our newly created Kibana instance suddenly fails to connect to ES as described here Kibana
Before Xmas it was working fine with a few days worth of data, then last week only data collection has been performed. Now Kibana complains of connection TO to ES. But I can connect to ES:kopf or ES:head and see our indices etc. and various logstash instances on every ES nodes are logging data to ES thou a central logstash indexer seems to overflow with input, assuming it also has issues with ES connection, but for now I stopped this logstash indexer. I've also tried restarting the hole ES cluster + KIbana, no changes, still Kibana do TO. Tried to delete all but last two days indices and restart hole ES cluster so data was down to similar size as before Xmas, but still Kibana do TO. Redirected Kibana to another smaller empty ES cluster and it connects fine, so I believe it's somehow our ES cluster that's having issues, only wondering what/why, any hints appreciated, TIA!

Is there anything in the Elasticsearch logs? What does the the cluster look like (nodes, memory, amount of Java heap, indices, shards, amount of data)?

No nothing in the ES log indicate any issues ImHO. Cluster is running on 14 data nodes (8core 64GB memory), 8G heap size for ES as nodes primary function are running cassandra JVM, but cassandra app is merely idling yet. So far I cut indices down to two days worth of data in +30 indices in ~300 shards

[root@d1r1n1 ~]# curl -XGET "http://`hostname`:9200/_cluster/health?pretty"
{
"cluster_name" : "mxes_data",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 14,
"number_of_data_nodes" : 14,
"active_primary_shards" : 141,
"active_shards" : 282,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

How are you connecting Kibana to the cluster (through one of the nodes, using a client node or perhaps through a load balancer or proxy)? How much data do you have in the cluster? How large are the shards? What latencies do you see if you run typical aggregations against the cluster outside Kibana?

Through one of the 14 data nodes, only got 14 data nodes, any one can be elected master.
So far cut down to 12 indices, 112 shards, ~31M docs < 5GB, ES nodes use less than 1GB heap of assign max 8GB heap size according to plugin:kopf/head, though jvm's virtual size seems high ~18GB, rss is only ~1GB

How would I see the shards sizes?
Dunno about latencies or how to run aggregations outside kibana, through curl f.ex.?

kibana.yml points to one ES data node like this:
elasticsearch.url: "http://d1r1n1.nat.tdcfoo:9200"

maybe I ought to go through a load balancer also for clients inserting docs into ES... though beats/logstash knowns howto use multiple nodes for ES output...

looking at a shard stats from kopf, it says f.ex. for largest index cassandra-YYYY-MM-DD with JMX data ~2.3M docs store.size: ~170MB

Kibana does currently not allow multiple nodes to be specified, so a common solution is to deploy Kibana together with a client node in order to spread the load across the cluster. For a cluster of that size I would recommend having 3 dedicated master nodes. Having 14 master eligible nodes is excessive, and 3 should be sufficient even if they are not dedicated.

You can see index and shard statistics through the cat shards and cat indices APIs.

With 14 master eligible nodes, can you confirm you have minimum_master_nodes correctly set to 8?

CAT api works fine from a ES nodes

[root@d1r1n1 ~]# curl http://d1r1n1.nat.tdcfoo:9200/_cat/shards
collectd-2016.01.01 3 p STARTED 247950 28.9mb 10.45.70.107 d1r1n7
collectd-2016.01.01 3 r STARTED 247950 28.9mb 10.45.70.101 d1r1n1
collectd-2016.01.01 2 p STARTED 248193 28.8mb 10.45.70.102 d1r1n2
collectd-2016.01.01 2 r STARTED 248193 28.8mb 10.45.70.112 d1r1n12
collectd-2016.01.01 4 r STARTED 248281 28.8mb 10.45.70.113 d1r1n13
collectd-2016.01.01 4 p STARTED 248281 28.8mb 10.45.70.105 d1r1n5
collectd-2016.01.01 1 p STARTED 247813 28.8mb 10.45.70.104 d1r1n4
collectd-2016.01.01 1 r STARTED 247813 28.8mb 10.45.70.108 d1r1n8
collectd-2016.01.01 0 r STARTED 246794 28.6mb 10.45.70.110 d1r1n10
collectd-2016.01.01 0 p STARTED 246794 28.7mb 10.45.70.111 d1r1n11
cassandra-2016.01.02 3 p STARTED 1017461 74.8mb 10.45.70.108 d1r1n8
cassandra-2016.01.02 3 r STARTED 1017461 74.8mb 10.45.70.105 d1r1n5
cassandra-2016.01.02 2 p STARTED 1015395 74.6mb 10.45.70.107 d1r1n7
cassandra-2016.01.02 2 r STARTED 1015395 74.6mb 10.45.70.111 d1r1n11
cassandra-2016.01.02 4 p STARTED 1016612 74.7mb 10.45.70.110 d1r1n10
cassandra-2016.01.02 4 r STARTED 1016612 74.9mb 10.45.70.114 d1r1n14
cassandra-2016.01.02 1 p STARTED 1018421 74.8mb 10.45.70.106 d1r1n6
cassandra-2016.01.02 1 r STARTED 1018421 74.8mb 10.45.70.101 d1r1n1
cassandra-2016.01.02 0 p STARTED 1019313 74.9mb 10.45.70.113 d1r1n13
cassandra-2016.01.02 0 r STARTED 1019313 74.9mb 10.45.70.103 d1r1n3
collectd-2016.01.02 3 r STARTED 106897 12.7mb 10.45.70.104 d1r1n4
collectd-2016.01.02 3 p STARTED 106897 12.7mb 10.45.70.111 d1r1n11
collectd-2016.01.02 2 r STARTED 106886 12.7mb 10.45.70.109 d1r1n9
collectd-2016.01.02 2 p STARTED 106886 12.7mb 10.45.70.101 d1r1n1
collectd-2016.01.02 4 r STARTED 106941 12.7mb 10.45.70.112 d1r1n12
collectd-2016.01.02 4 p STARTED 106941 12.7mb 10.45.70.105 d1r1n5
collectd-2016.01.02 1 r STARTED 106532 12.6mb 10.45.70.102 d1r1n2
collectd-2016.01.02 1 p STARTED 106532 12.6mb 10.45.70.103 d1r1n3
collectd-2016.01.02 0 p STARTED 106922 12.7mb 10.45.70.113 d1r1n13
collectd-2016.01.02 0 r STARTED 106922 12.7mb 10.45.70.114 d1r1n14
cassandra-2016.01.01 3 p STARTED 2353092 172.2mb 10.45.70.106 d1r1n6
cassandra-2016.01.01 3 r STARTED 2353092 172mb 10.45.70.114 d1r1n14
cassandra-2016.01.01 2 p STARTED 2357139 172.5mb 10.45.70.107 d1r1n7
cassandra-2016.01.01 2 r STARTED 2357139 172.5mb 10.45.70.110 d1r1n10
cassandra-2016.01.01 4 p STARTED 2354945 172.1mb 10.45.70.109 d1r1n9
cassandra-2016.01.01 4 r STARTED 2354945 171.9mb 10.45.70.103 d1r1n3
cassandra-2016.01.01 1 p STARTED 2355439 171.1mb 10.45.70.104 d1r1n4
cassandra-2016.01.01 1 r STARTED 2355439 171.1mb 10.45.70.108 d1r1n8
...

CAT API seems to hang/not connect from my kibana VM even though TCP connect works...

[root@kibana tmp]# curl http://d1r1n1.nat.tdcfoo:9200/_cat/shards
^C
[root@kibana tmp]# telnet d1r1n1.nat.tdcfoo 9200
Trying 10.45.70.101...
Connected to d1r1n1.nat.tdcfoo.
Escape character is '^]'.
quit
Connection closed by foreign host.
[root@kibana tmp]#

Yes I've got min masters set to 8,

# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
gateway.recover_after_nodes: 8
gateway.expected_nodes: 14
gateway.recover_after_time: 10m
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-gateway.html>
#
# --------------------------------- Discovery ----------------------------------
#
# Elasticsearch nodes will find each other via unicast, by default.
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["d1r1n1", "d1r1n2", "d1r1n3", "d1r1n4", "d1r1n5", "d1r1n6", "d1r1n7", "d1r1n8", "d1r1n9", "d1r1n10", "d1r1n11", "d1r1n12", "d1r1n13", "d1r1n14"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 8

It seems every REST call from either my Kibana VM or central logstash VM isn't answered, but locally from a ES node it works...

[root@logstash log]# curl http://d1r1n1.nat.tdcfoo:9200/_cat/shards
^C
[root@logstash log]# telnet d1r1n1.nat.tdcfoo 9200
Trying 10.45.70.101...
Connected to d1r1n1.nat.tdcfoo.
Escape character is '^]'.
GET /_cat/shards HTTP/1.0

(just hangs here, no reply...)

ES log can be provoked to show this on simple interrupted telnet connection:

[2016-01-02 11:38:26,876][WARN ][http.netty               ] [d1r1n1] Caught exception while handling client http traffic, closing connection [id: 0x343b4cd5, /10.45.70.62:42228 => /10.45.70.101:9200]
java.lang.IllegalArgumentException: empty text
        at org.jboss.netty.handler.codec.http.HttpVersion.<init>(HttpVersion.java:89)
...

so my client VMs can make tcp connections but they seem not to get any answers from ES, hence Kibana will eventually make a TO. Only nothing really changes during last week, except more data got collected into ES cluster.

Finally figured it out. It turned out one of our hypervisor nodes, where my kibana Vm was running somehow reverted a openvswitch to default mtu of 1500 instead of 9000, which kibana VM used, hence tcp replies never came back from ES nodes which also are using mtu of 9000 :slight_smile: