I run curator ( forcemerge ) on previous day's indices to reduce the segment sizes, however of late i have observed the merge operation failing and in turn brings down the cluster.
Well tried that earlier, the limit is set to 7000 characters, hence posted the image. Is there the option to upload log files.
Truncated logs ...
[2018-06-17T03:20:58,080][WARN ][o.e.c.a.s.ShardStateAction] [master] [cluster-metricbeat-2018.06.17][0] received shard failed for shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [6mNTb2ZpThOzkm9tqR
zFQQ], primary term [0], message [shard failure, reason [merge failed]], failure [NotSerializableExceptionWrapper[merge_exception: java.lang.ArrayIndexOutOfBoundsException: 23]; nested: ArrayIndexOutOfBoundsExce
ption[23]; ]
org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: merge_exception: java.lang.ArrayIndexOutOfBoundsException: 23
at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2101) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: java.lang.ArrayIndexOutOfBoundsException: 23
at org.apache.lucene.codecs.lucene70.Lucene70DocValuesProducer$7.longValue(Lucene70DocValuesProducer.java:552) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10
00:48:43]
at org.apache.lucene.index.SingletonSortedNumericDocValues.nextValue(SingletonSortedNumericDocValues.java:73) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10
00:48:43]
at org.apache.lucene.codecs.DocValuesConsumer$3$1.nextValue(DocValuesConsumer.java:446) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValues(Lucene70DocValuesConsumer.java:162) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10
00:48:43]
at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.addSortedNumericField(Lucene70DocValuesConsumer.java:530) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu -
2018-01-10 00:48:43]
at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:371) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:143) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:151) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01
-10 00:48:43]
at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:181) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:125) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4443) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4083) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:99) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:4
3]
[2018-06-17T03:20:58,161][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[cluster-metricbeat-2018.06.17][0]] ...]).
[2018-06-17T03:21:28,758][WARN ][o.e.d.z.PublishClusterStateAction] [master] timed out waiting for all nodes to process published state [1996] (timeout [30s], pending nodes: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h
0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}])
[2018-06-17T03:21:28,779][WARN ][o.e.c.s.MasterService ] [master] cluster state update task [cluster_reroute(async_shard_fetch)] took [30s] above the warn threshold of 30s
[2018-06-17T03:21:28,828][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[cluster-metricbeat-2018.06.17][0]] ...]).
[2018-06-17T03:21:58,831][WARN ][o.e.d.z.PublishClusterStateAction] [master] timed out waiting for all nodes to process published state [1997] (timeout [30s], pending nodes: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h
0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}])
[2018-06-17T03:21:59,037][WARN ][o.e.c.s.MasterService ] [master] cluster state update task [shard-started shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [PCnuW12oRt67drHdQVe50A], primary term [
0], message [after peer recovery][shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [PCnuW12oRt67drHdQVe50A], primary term [0], message [after peer recovery]]] took [30.2s] above the warn threshold of
30s
[2018-06-17T03:22:43,438][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.
10.24:9500} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2018-06-17T03:22:45,306][INFO ][o.e.c.s.MasterService ] [master] zen-disco-node-failed({data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}), reason(failed to ping, tried [3
] times, each with maximum [30s] timeout)[{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed
{{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500},}
It looks like you have problems with the cluster state not getting updated quickly enough across the cluster. Do you have very large mappings, mappings that are updated frequently or a lot of shards in the cluster?
This exception should never happen, which suggests either a corruption or a JVM/Lucene bug. The corruption is a bit unlikely since we verify the integrity of each segment that participates in a merge before merging. Can you run CheckIndex on [cluster-metricbeat-2018.06.17][0] on this node? The command would look something like java -cp libs/lucene-core-7.2.1.jar org.apache.lucene.index.CheckIndex data/nodes/0/indices/{uuid of cluster-metricbeat-2018.06.17}/0/index/. Note that you need to either stop this node or close the cluster-metricbeat-2018.06.17 index before running it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.