Forcemerge failed ( java.lang.ArrayIndexOutOfBoundsException )

pfseth · June 18, 2018, 2:05pm

Hi,

I run curator ( forcemerge ) on previous day's indices to reduce the segment sizes, however of late i have observed the merge operation failing and in turn brings down the cluster.

Appreciate any feedback ( Image attached )

Elasticsearch Version : 6.2.2

dadoonet · June 18, 2018, 2:18pm

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

pfseth · June 18, 2018, 2:27pm

Well tried that earlier, the limit is set to 7000 characters, hence posted the image. Is there the option to upload log files.

Truncated logs ...

[2018-06-17T03:20:58,080][WARN ][o.e.c.a.s.ShardStateAction] [master] [cluster-metricbeat-2018.06.17][0] received shard failed for shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [6mNTb2ZpThOzkm9tqR
zFQQ], primary term [0], message [shard failure, reason [merge failed]], failure [NotSerializableExceptionWrapper[merge_exception: java.lang.ArrayIndexOutOfBoundsException: 23]; nested: ArrayIndexOutOfBoundsExce
ption[23]; ]
org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: merge_exception: java.lang.ArrayIndexOutOfBoundsException: 23
	at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2101) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: java.lang.ArrayIndexOutOfBoundsException: 23
	at org.apache.lucene.codecs.lucene70.Lucene70DocValuesProducer$7.longValue(Lucene70DocValuesProducer.java:552) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10
 00:48:43]
	at org.apache.lucene.index.SingletonSortedNumericDocValues.nextValue(SingletonSortedNumericDocValues.java:73) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 
00:48:43]
	at org.apache.lucene.codecs.DocValuesConsumer$3$1.nextValue(DocValuesConsumer.java:446) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValues(Lucene70DocValuesConsumer.java:162) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10
 00:48:43]
	at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.addSortedNumericField(Lucene70DocValuesConsumer.java:530) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 
2018-01-10 00:48:43]
	at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:371) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:143) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:151) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01
-10 00:48:43]
	at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:181) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:125) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4443) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4083) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:99) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:4
3]
[2018-06-17T03:20:58,161][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[cluster-metricbeat-2018.06.17][0]] ...]).
[2018-06-17T03:21:28,758][WARN ][o.e.d.z.PublishClusterStateAction] [master] timed out waiting for all nodes to process published state [1996] (timeout [30s], pending nodes: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h
0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}])
[2018-06-17T03:21:28,779][WARN ][o.e.c.s.MasterService    ] [master] cluster state update task [cluster_reroute(async_shard_fetch)] took [30s] above the warn threshold of 30s
[2018-06-17T03:21:28,828][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[cluster-metricbeat-2018.06.17][0]] ...]).
[2018-06-17T03:21:58,831][WARN ][o.e.d.z.PublishClusterStateAction] [master] timed out waiting for all nodes to process published state [1997] (timeout [30s], pending nodes: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h
0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}])
[2018-06-17T03:21:59,037][WARN ][o.e.c.s.MasterService    ] [master] cluster state update task [shard-started shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [PCnuW12oRt67drHdQVe50A], primary term [
0], message [after peer recovery][shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [PCnuW12oRt67drHdQVe50A], primary term [0], message [after peer recovery]]] took [30.2s] above the warn threshold of
 30s
[2018-06-17T03:22:43,438][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.
10.24:9500} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2018-06-17T03:22:45,306][INFO ][o.e.c.s.MasterService    ] [master] zen-disco-node-failed({data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}), reason(failed to ping, tried [3
] times, each with maximum [30s] timeout)[{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed
 {{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500},}

dadoonet · June 18, 2018, 2:45pm

I think those logs might be enough...

But if at some point you need to upload more, you can always share that on gist.github.com for example and paste the link here....

@jpountz do you have an idea?

Christian_Dahlqvist · June 18, 2018, 2:51pm

It looks like you have problems with the cluster state not getting updated quickly enough across the cluster. Do you have very large mappings, mappings that are updated frequently or a lot of shards in the cluster?

pfseth · June 18, 2018, 3:09pm

Do you have large mappings - No
Mapping update frequently - No

Lots of shards -

Well, the total number of shards is 1608 ( 803 Primary & 805 Replica )

However, i forcemerge only the previous day's indices ( i.e total 6 indices = 36 shards ( 18 primary + 18 replica ))

Christian_Dahlqvist · June 18, 2018, 3:14pm

How many nodes? Are they under heavy load?

pfseth · June 18, 2018, 3:19pm

6 Nodes

2 - Coordinating nodes
4 - Master + Data ( any of the 4 nodes are master eligible with quorum set to 3 )

Well i run curator at 3:00 AM when the load is low

jpountz · June 18, 2018, 3:54pm

This exception should never happen, which suggests either a corruption or a JVM/Lucene bug. The corruption is a bit unlikely since we verify the integrity of each segment that participates in a merge before merging. Can you run CheckIndex on [cluster-metricbeat-2018.06.17][0] on this node? The command would look something like java -cp libs/lucene-core-7.2.1.jar org.apache.lucene.index.CheckIndex data/nodes/0/indices/{uuid of cluster-metricbeat-2018.06.17}/0/index/. Note that you need to either stop this node or close the cluster-metricbeat-2018.06.17 index before running it.

pfseth · June 18, 2018, 5:03pm

Unfortunately, its in production so cannot stop the node.

However i can close the index, but what is the expected output of this command ?

java -cp libs/lucene-core-7.2.1.jar org.apache.lucene.index.CheckIndex data/nodes/0/indices/{uuid of cluster-metricbeat-2018.06.17}/0/index/

jpountz · June 18, 2018, 5:23pm

It will generate a report about the consistency of the Lucene index of this shard, giving us information about a potential corruption or Lucene bug. See https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/index/CheckIndex.html.

pfseth · June 18, 2018, 6:11pm

Here is the output, CheckIndex of cluster-metricbeat-2018.06.17

pfseth · June 21, 2018, 4:21pm

@jpountz any luck with report ? is it some config issue or a bug ?

system · July 19, 2018, 4:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Merge Error with lucene 4.1 Elasticsearch	8	709	July 6, 2017
Shards getting closed on Indexing Data Elasticsearch	6	431	July 6, 2017
Shard removed from cluster - MergeException java.lang.ArrayIndexOutOfBoundsException Elasticsearch	3	772	April 10, 2017
Index "failed to merge" Elasticsearch	5	1934	July 6, 2017
Elasticsearch merging woes Elasticsearch	6	362	July 6, 2017

Forcemerge failed ( java.lang.ArrayIndexOutOfBoundsException )

Related topics