Forcemerge failed ( java.lang.ArrayIndexOutOfBoundsException )


(Santosh Hulkund) #1

Hi,

I run curator ( forcemerge ) on previous day's indices to reduce the segment sizes, however of late i have observed the merge operation failing and in turn brings down the cluster.

Appreciate any feedback ( Image attached )

Elasticsearch Version : 6.2.2


(David Pilato) #2

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.


(Santosh Hulkund) #3

Well tried that earlier, the limit is set to 7000 characters, hence posted the image. Is there the option to upload log files.

Truncated logs ...

[2018-06-17T03:20:58,080][WARN ][o.e.c.a.s.ShardStateAction] [master] [cluster-metricbeat-2018.06.17][0] received shard failed for shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [6mNTb2ZpThOzkm9tqR
zFQQ], primary term [0], message [shard failure, reason [merge failed]], failure [NotSerializableExceptionWrapper[merge_exception: java.lang.ArrayIndexOutOfBoundsException: 23]; nested: ArrayIndexOutOfBoundsExce
ption[23]; ]
org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: merge_exception: java.lang.ArrayIndexOutOfBoundsException: 23
	at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2101) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: java.lang.ArrayIndexOutOfBoundsException: 23
	at org.apache.lucene.codecs.lucene70.Lucene70DocValuesProducer$7.longValue(Lucene70DocValuesProducer.java:552) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10
 00:48:43]
	at org.apache.lucene.index.SingletonSortedNumericDocValues.nextValue(SingletonSortedNumericDocValues.java:73) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 
00:48:43]
	at org.apache.lucene.codecs.DocValuesConsumer$3$1.nextValue(DocValuesConsumer.java:446) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValues(Lucene70DocValuesConsumer.java:162) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10
 00:48:43]
	at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.addSortedNumericField(Lucene70DocValuesConsumer.java:530) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 
2018-01-10 00:48:43]
	at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:371) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:143) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:151) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01
-10 00:48:43]
	at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:181) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:125) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4443) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4083) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:99) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:4
3]
[2018-06-17T03:20:58,161][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[cluster-metricbeat-2018.06.17][0]] ...]).
[2018-06-17T03:21:28,758][WARN ][o.e.d.z.PublishClusterStateAction] [master] timed out waiting for all nodes to process published state [1996] (timeout [30s], pending nodes: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h
0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}])
[2018-06-17T03:21:28,779][WARN ][o.e.c.s.MasterService    ] [master] cluster state update task [cluster_reroute(async_shard_fetch)] took [30s] above the warn threshold of 30s
[2018-06-17T03:21:28,828][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[cluster-metricbeat-2018.06.17][0]] ...]).
[2018-06-17T03:21:58,831][WARN ][o.e.d.z.PublishClusterStateAction] [master] timed out waiting for all nodes to process published state [1997] (timeout [30s], pending nodes: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h
0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}])
[2018-06-17T03:21:59,037][WARN ][o.e.c.s.MasterService    ] [master] cluster state update task [shard-started shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [PCnuW12oRt67drHdQVe50A], primary term [
0], message [after peer recovery][shard id [[cluster-metricbeat-2018.06.17][0]], allocation id [PCnuW12oRt67drHdQVe50A], primary term [0], message [after peer recovery]]] took [30.2s] above the warn threshold of
 30s
[2018-06-17T03:22:43,438][INFO ][o.e.c.r.a.AllocationService] [master] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.
10.24:9500} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2018-06-17T03:22:45,306][INFO ][o.e.c.s.MasterService    ] [master] zen-disco-node-failed({data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500}), reason(failed to ping, tried [3
] times, each with maximum [30s] timeout)[{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed
 {{data2}{99iDFQX-QMWiC9Og9NZB-Q}{s05h0qgvRmuqmHJatHLY-g}{10.10.10.24}{10.10.10.24:9500},}

(David Pilato) #4

I think those logs might be enough...

But if at some point you need to upload more, you can always share that on gist.github.com for example and paste the link here....

@jpountz do you have an idea?


(Christian Dahlqvist) #5

It looks like you have problems with the cluster state not getting updated quickly enough across the cluster. Do you have very large mappings, mappings that are updated frequently or a lot of shards in the cluster?


(Santosh Hulkund) #6

Do you have large mappings - No
Mapping update frequently - No

Lots of shards -

Well, the total number of shards is 1608 ( 803 Primary & 805 Replica )

However, i forcemerge only the previous day's indices ( i.e total 6 indices = 36 shards ( 18 primary + 18 replica ))


(Christian Dahlqvist) #7

How many nodes? Are they under heavy load?


(Santosh Hulkund) #8

6 Nodes

2 - Coordinating nodes
4 - Master + Data ( any of the 4 nodes are master eligible with quorum set to 3 )

Well i run curator at 3:00 AM when the load is low


(Adrien Grand) #9

This exception should never happen, which suggests either a corruption or a JVM/Lucene bug. The corruption is a bit unlikely since we verify the integrity of each segment that participates in a merge before merging. Can you run CheckIndex on [cluster-metricbeat-2018.06.17][0] on this node? The command would look something like java -cp libs/lucene-core-7.2.1.jar org.apache.lucene.index.CheckIndex data/nodes/0/indices/{uuid of cluster-metricbeat-2018.06.17}/0/index/. Note that you need to either stop this node or close the cluster-metricbeat-2018.06.17 index before running it.


(Santosh Hulkund) #10

Unfortunately, its in production so cannot stop the node.

However i can close the index, but what is the expected output of this command ?

java -cp libs/lucene-core-7.2.1.jar org.apache.lucene.index.CheckIndex data/nodes/0/indices/{uuid of cluster-metricbeat-2018.06.17}/0/index/


(Adrien Grand) #11

It will generate a report about the consistency of the Lucene index of this shard, giving us information about a potential corruption or Lucene bug. See https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/index/CheckIndex.html.


(Santosh Hulkund) #12

Here is the output, CheckIndex of cluster-metricbeat-2018.06.17


(Santosh Hulkund) #13

@jpountz any luck with report ? is it some config issue or a bug ?


(system) #14

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.