es lose data when the cluster is deleting index.As shown below
how to fix the problem?
I mean, when I delete the index, the amount of data being collected is also reduced.The picture shows the amount of data collected at each time(per minute).
The cluster has four nodes, one of them setting is:
node.master: false
node.data: false
node.ingest: false
Sorry, I didn’t make it clear.I mean, when I delete the index, the amount of data being collected is also reduced.The picture shows the amount of data collected at each time(per minute).
Is there anything in the logs around that time? If indexing slows down at that time, this does generally not mean that data will be lost as clients generally will retry on failure or timeout.
es logs:
[2018-12-20T04:02:40,847][INFO ][o.e.c.m.MetaDataDeleteIndexService] [iom] [filebeat-2018.12.12/AzYCjS0jTsy3ovBI8An13A] deleting index
[2018-12-20T04:03:10,880][WARN ][o.e.d.z.PublishClusterStateAction] [iom] timed out waiting for all nodes to process published state [3888] (timeout [30s], pending nodes: [{iom}{XWu3adjESaGrlDdqONE4-A}{SKwUJfhaT8GeO31NxV-Fbw}{172.31.24.86}{172.31.24.86:9300}, {iom}{ZEh8uJqGSV2EytseLQSz4w}{82xwXo0eQNmVV6aWfTXVTw}{172.31.24.88}{172.31.24.88:9300}])
[2018-12-20T04:05:10,947][DEBUG][o.e.a.a.i.d.TransportDeleteIndexAction] [iom] failed to delete indices [[[metricbeat-2018.12.12/S0asWes4TZyGJ6XCCjg76g]]]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (delete-index [[metricbeat-2018.12.12/S0asWes4TZyGJ6XCCjg76g]]) within 2m
at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255) ~[elasticsearch-5.6.3.jar:5.6.3]
at java.util.ArrayList.forEach(ArrayList.java:1255) ~[?:1.8.0_151]
at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$onTimeout$1(ClusterService.java:254) ~[elasticsearch-5.6.3.jar:5.6.3]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.3.jar:5.6.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-12-20T04:11:00,175][WARN ][o.e.c.s.ClusterService ] [iom] cluster state update task [delete-index [[filebeat-2018.12.12/AzYCjS0jTsy3ovBI8An13A]]] took [8.3m] above the warn threshold of 30s
Our cluster usually delete the index at 4:00 am, the above logs show that the cluster is deleting the index called filebeat-2018.12.12(because the cluster stores 7 days of the data).
It looks like it is timing out publishing the cluster state. Is the cluster under heavy load? Do you see any long or frequent GC? Do you have minimum_master_nodes set to 2 given that you appear to have 3 master-eligible nodes?
It seems that our cluster is really heavily loaded, each ES node has 16 core CPUs, but the cpu load is above 10, and one of them is around 16.
But GC is rare, the ES node is 64G memory, and the es memory configured is 31G.
minimum_master_nodes set is 2
It is quite possible that the cluster is overloaded and not able to process and distribute changes to the cluster state fast enough. One way to address this would be to scale out the cluster and distribute the load across more hosts. You may also benefit from introducing 3 small dedicated master nodes. These typically do not use a lot of resources, but as they do not hold data or serve traffic they do not get overloaded and can focus on managing the cluster state.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.