Forcemerge failed ( java.lang.ArrayIndexOutOfBoundsException )

Hi,

I run curator ( forcemerge ) on previous day's indices to reduce the segment sizes, however of late i have observed the merge operation failing and in turn brings down the cluster.

Appreciate any feedback ( Image attached )

Elasticsearch Version : 6.2.2

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

Well tried that earlier, the limit is set to 7000 characters, hence posted the image. Is there the option to upload log files.

Truncated logs ...

[2018-06-17T03:20:58,080][WARN ][o.e.c.a.s.ShardStateAction] [master] [cluster-metricbeat-2018.06.17][0] received shard failed for shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [6mNTb2ZpThOzkm9tqR
zFQQ], primary term [0], message [shard failure, reason [merge failed]], failure [NotSerializableExceptionWrapper[merge_exception: java.lang.ArrayIndexOutOfBoundsException: 23]; nested: ArrayIndexOutOfBoundsExce
ption[23]; ]
org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: merge_exception: java.lang.ArrayIndexOutOfBoundsException: 23
	at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2101) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: java.lang.ArrayIndexOutOfBoundsException: 23
	at org.apache.lucene.codecs.lucene70.Lucene70DocValuesProducer$7.longValue(Lucene70DocValuesProducer.java:552) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10
 00:48:43]
	at org.apache.lucene.index.SingletonSortedNumericDocValues.nextValue(SingletonSortedNumericDocValues.java:73) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 
00:48:43]
	at org.apache.lucene.codecs.DocValuesConsumer$3$1.nextValue(DocValuesConsumer.java:446) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValues(Lucene70DocValuesConsumer.java:162) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10
 00:48:43]
	at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.addSortedNumericField(Lucene70DocValuesConsumer.java:530) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 
2018-01-10 00:48:43]
	at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:371) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:143) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:151) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01
-10 00:48:43]
	at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:181) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:125) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4443) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4083) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:99) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:4
3]
[2018-06-17T03:20:58,161][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[cluster-metricbeat-2018.06.17][0]] ...]).
[2018-06-17T03:21:28,758][WARN ][o.e.d.z.PublishClusterStateAction] [master] timed out waiting for all nodes to process published state [1996] (timeout [30s], pending nodes: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h
0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}])
[2018-06-17T03:21:28,779][WARN ][o.e.c.s.MasterService    ] [master] cluster state update task [cluster_reroute(async_shard_fetch)] took [30s] above the warn threshold of 30s
[2018-06-17T03:21:28,828][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[cluster-metricbeat-2018.06.17][0]] ...]).
[2018-06-17T03:21:58,831][WARN ][o.e.d.z.PublishClusterStateAction] [master] timed out waiting for all nodes to process published state [1997] (timeout [30s], pending nodes: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h
0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}])
[2018-06-17T03:21:59,037][WARN ][o.e.c.s.MasterService    ] [master] cluster state update task [shard-started shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [PCnuW12oRt67drHdQVe50A], primary term [
0], message [after peer recovery][shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [PCnuW12oRt67drHdQVe50A], primary term [0], message [after peer recovery]]] took [30.2s] above the warn threshold of
 30s
[2018-06-17T03:22:43,438][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.
10.24:9500} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2018-06-17T03:22:45,306][INFO ][o.e.c.s.MasterService    ] [master] zen-disco-node-failed({data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}), reason(failed to ping, tried [3
] times, each with maximum [30s] timeout)[{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed
 {{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500},}

I think those logs might be enough...

But if at some point you need to upload more, you can always share that on gist.github.com for example and paste the link here....

@jpountz do you have an idea?

It looks like you have problems with the cluster state not getting updated quickly enough across the cluster. Do you have very large mappings, mappings that are updated frequently or a lot of shards in the cluster?

Do you have large mappings - No
Mapping update frequently - No

Lots of shards -

Well, the total number of shards is 1608 ( 803 Primary & 805 Replica )

However, i forcemerge only the previous day's indices ( i.e total 6 indices = 36 shards ( 18 primary + 18 replica ))

How many nodes? Are they under heavy load?

6 Nodes

2 - Coordinating nodes
4 - Master + Data ( any of the 4 nodes are master eligible with quorum set to 3 )

Well i run curator at 3:00 AM when the load is low

This exception should never happen, which suggests either a corruption or a JVM/Lucene bug. The corruption is a bit unlikely since we verify the integrity of each segment that participates in a merge before merging. Can you run CheckIndex on [cluster-metricbeat-2018.06.17][0] on this node? The command would look something like java -cp libs/lucene-core-7.2.1.jar org.apache.lucene.index.CheckIndex data/nodes/0/indices/{uuid of cluster-metricbeat-2018.06.17}/0/index/. Note that you need to either stop this node or close the cluster-metricbeat-2018.06.17 index before running it.

Unfortunately, its in production so cannot stop the node.

However i can close the index, but what is the expected output of this command ?

java -cp libs/lucene-core-7.2.1.jar org.apache.lucene.index.CheckIndex data/nodes/0/indices/{uuid of cluster-metricbeat-2018.06.17}/0/index/

It will generate a report about the consistency of the Lucene index of this shard, giving us information about a potential corruption or Lucene bug. See https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/index/CheckIndex.html.

Here is the output, CheckIndex of cluster-metricbeat-2018.06.17

@jpountz any luck with report ? is it some config issue or a bug ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.