Kibana index loosing all contents after cluster nodes rebooted

Hi All,

Need your advice on the following issue please.
Sorry if I'm posting this into wrong section, as I'm truly unsure if this is es or kb issue. However this happens only to .kibana index and has never happened to any of the other indexes on this cluster carrying actual event data.
We are running es & kb ver 6.5.4 and have a 4 node elk cluster, all nodes are configured in exactly same way, all carrying all roles and all have kibana installed as well.

Config is as below, only node names are different:
es
cluster.name: elasticearch
node.name: node-1
path.data: /path to/elk/data
path.logs: /path to/elk/logs
bootstrap.memory_lock: true
bootstrap.system_call_filter: false
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["host1.fqdn", "host2.fqdn", "host3.fqdn", "host4.fqdn"]
action.destructive_requires_name: true

kb
server.port: 5601
server.host: "0.0.0.0"
elasticsearch.url: "http://localhost:9200"
kibana.index: ".kibana"

The cluster was running fine for more than a month until one weekend all nodes are rebooted in random fashion due to OS patching. From what I discovered ES/KB processes were not stopped gracefully before this reboot (not our usual practice but what’s done is done).

Post reboot, although. kibana index (actually kibana alias pointing to .kibana_1) is green... BUT its contents are EMPTY - as in all dashboards, visualizations and index patterns are gone. The size of this index after reboot has also decreased dramatically from ~360+ kb to about 42kb.

The only clue i was able to fish out of elk logs reg .kibana_1 index was this below:
"can't to issue sync id", full extract below (also this 127 documents count thing actually looks like the correct number of documents before it goes all blank)
I googled for this message and got here: https://github.com/elastic/elasticsearch/pull/28464

[2019-07-22T04:41:37,323][INFO ][o.e.c.r.a.AllocationService] [node-1] updating number_of_replicas to [0] for indices [.kibana_1]
[2019-07-22T04:42:36,896][INFO ][o.e.c.r.a.AllocationService] [node-1] updating number_of_replicas to [1] for indices [.kibana_1]
[2019-07-22T04:46:39,824][WARN ][o.e.i.f.SyncedFlushService] [node-1] [.kibana_1][0] can't to issue sync id [WQY1b7lLTXCtpUYkskp49w] for out of sync replica [[.kibana_1][0], node[VmxzK3A2RTSG0zYV16E2VA], [R], s[STARTED], a[id=jzhXm0SUTHyoVw7DWjBkrA]] with num docs [127]; num docs on primary [3]
[2019-07-22T07:18:51,823][WARN ][o.e.i.f.SyncedFlushService] [node-1] [.kibana_1][0] can't to issue sync id [lrnwGe5dTPGItmOMptruzQ] for out of sync replica [[.kibana_1][0], node[VmxzK3A2RTSG0zYV16E2VA], [R], s[STARTED], a[id=jzhXm0SUTHyoVw7DWjBkrA]] with num docs [127]; num docs on primary [3]
[2019-07-22T08:47:28,883][WARN ][o.e.i.f.SyncedFlushService] [node-1] [.kibana_1][0] can't to issue sync id [x8k18pcmQtuNX3nqAgY3CA] for out of sync replica [[.kibana_1][0], node[VmxzK3A2RTSG0zYV16E2VA], [R], s[STARTED], a[id=jzhXm0SUTHyoVw7DWjBkrA]] with num docs [127]; num docs on primary [3]
[2019-07-22T10:26:47,943][INFO ][o.e.c.m.MetaDataMappingService] [node-1] [.kibana_1/Tf0C6Cz9RkWc-Z4Rx2OKQw] update_mapping [doc]
..
[2019-07-22T19:38:54,044][INFO ][o.e.c.r.a.AllocationService] [node-1] updating number_of_replicas to [0] for indices [.kibana_1]
[2019-07-22T19:39:00,729][WARN ][o.e.i.c.IndicesClusterStateService] [node-1] [[.kibana_1][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.kibana_1][0]: Recovery failed on {node-1}{XyQGv0MuQKCcLMin8U6DZA}{FT4V3xVVQZ268j20km26MA}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=67386126336, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
[2019-07-22T19:39:00,751][WARN ][o.e.c.r.a.AllocationService] [node-1] failing shard [failed shard, shard [.kibana_1][0], node[XyQGv0MuQKCcLMin8U6DZA], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=ptNydTOjS9C8Snful5bBTA], unassigned_info[[reason=CLUSTER_RECOVERED], at[2019-07-22T19:38:54.039Z], delayed=false, allocation_status[deciders_throttled]], message [failed recovery], failure [RecoveryFailedException[[.kibana_1][0]: Recovery failed on {node-1}{XyQGv0MuQKCcLMin8U6DZA}{FT4V3xVVQZ268j20km26MA}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=67386126336, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogException[failed to create new translog file]; nested: FileAlreadyExistsException[/path to/data/nodes/0/indices/Tf0C6Cz9RkWc-Z4Rx2OKQw/0/translog/translog-17.tlog]; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.kibana_1][0]: Recovery failed on {node-1}{XyQGv0MuQKCcLMin8U6DZA}{FT4V3xVVQZ268j20km26MA}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=67386126336, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
[2019-07-22T19:39:00,798][WARN ][o.e.i.t.Translog ] [node-1] [.kibana_1][0] deleted previously created, but not yet committed, next generation [translog-17.tlog]. This can happen due to a tragic exception when creating a new generation
[2019-07-22T19:46:16,334][INFO ][o.e.c.r.a.AllocationService] [node-1] updating number_of_replicas to [1] for indices [.kibana_1]

We had a full filesystem backup (literally everything from data to all the dirs with binaries and logs) from all nodes on Friday prior to this reboot event on saturday which we attempted restoring, but no dice.
Once we started elk/kibana after restore - .kibana index looks like it was just created, absolutely 0 contents.

What I'm struggling to understand is this - last dashboards were created on thursday, everything was available from all nodes on Friday (users definitely used their dahshboards with no issues).
Friday no new content was created in kibana index and everything was present, yet after saturday reboot its all gone?

Moreover, how come a full restore of friday state of all nodes results in full data loss in just the .kibana index - is it somewhat special comparing to others?

We still have this full file backup from all nodes, and willing to try anything possible to get the data in kibana index back as there was a ton of work done for those dashboards/visualizations and index patterns.
Is there anything else to try to fix this or we really lost it all beyond recovery at this point?

When you restore your data directories and only start up Elasticsearch, can you check the .kibana alias and .kibana* indices to see if there's docs in them.

Then after you find you do have docs, start up Kibana. Kibana will check if it has to go through the migrations which creates a new index and changes the alias to point to it.

Kibana may not handle the case where it starts up and tries to do the migration steps while Elasticsearch isn't ready.

I suspect this might be due to you suffering from split-brain due to your Elasticsearch cluster not being correctly configured. As you have 4 master eligible nodes, you must set discovery.zen.minimum_master_nodes to 3.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.