Drive Clear Issue with ES 2.4.1

(S) #1

We just setup a test environment for Graylog running Ubuntu 16.04 LTS Server - 1 Graylog server with embedded ES node and 2 other ES only nodes all running ES 2.4.1 - We did not do any custom tweaks to ES accept changing - after running for over a week we just noticed yesterday that our drives used on ES nodes that had been running around 68-70% usage had gone back down to 7% - We looked within graylog and see the total indices but all our messages are gone - Is this a config issue with ES 2.4.1 - did something go wrong where ES automatically flushes indices / clears drive space, etc? We have - we had been writing our syslogs into graylog system for all nodes - so perhaps we can gather some logs for review - we are new to ES here so would love some help/insight as to what happened here. Thanks

(Yannick Welsch) #2

can you provide the Elasticsearch logs?

(Mark Walkom) #3

ES will not delete documents by itself.

(Christian Dahlqvist) #4

If you have just changed the cluster name to match and updated the unicast hosts list so that they can connect, it is possible that all the nodes are master eligible while the minimum_master_nodes setting is 1 instead of the required 2. If the system was designed to run on a single embedded node it is also possible (although I do not know Graylog) that the number of replicas is configured to 0. This has the potential to cause issues in case of split brain scenarios.

If you want to run with a greater number of nodes for additional storage and processing, it might make sense to add these as pure data nodes and make sure that you have 1 replica configured, but I am not sure if the fact that you apparently have an embedded node as part of the cluster will affect this. Your best bet is probably contacting Graylog about this.

(Jörg Prante) #5

Changing the cluster name on a node makes indices of the old cluster name not accessible any more. You will get into deep problems if you do this by a "rolling change" node by node and minimum_master_nodes is still 1 which is the default. Part of the cluster nodes which are up will still use the old cluster name, while the ones restarted will not. But the first node with a new cluster name will immediately create a fresh cluster state because you configured only one master by configuration. So, old indices become prone to deletion if you had not immediately reversed this.

Elasticsearch used to delete data when detecting dangling indices after some hours. Not sure if 2.4 is still doing this and how this could be an issue here, but you should check logs of the old cluster nodes for dangling indices.

(system) #6