MergeException due to illegal state in PerFieldPostingsFormat es 6.3.0

Greetings,

I have a complex integration test that is failing systematically since we upgraded the Elasticsearch cluster to 6.3.0 (Lucene 7.3.1).

The exact same test using an Elasticsearch cluster in version 6.2.4 (Lucene 7.2.1) is successful.

Basically the test is submitting concurrent indexing and bulk indexing requests, at some point a merge exception is raised, this ends up by an Elasticsearch shard failure (cluster health status is
RED).

The Elasticsearch and OS logs look normal until the merge exception below.

So far I was not able to reproduce the problem outside of this dockerized test environment which is running on a specific slave.

I would like some guidance to help categorizing this problem.

Regards

ben

[2018-06-26T10:12:36,417][INFO ][o.e.c.m.MetaDataMappingService] [dZQ-7Yb] [nuxeo/6eYqvs-BS0K4X01NkwrVRQ] update_mapping [doc]
[2018-06-26T10:12:59,907][WARN ][o.e.i.e.Engine ] [dZQ-7Yb] [nuxeo][0] failed engine [merge failed]
org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: found existing value for PerFieldPostingsFormat.format, field=note:note.fulltext, old=Lucene50, new=Lucene50
at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2113) [elasticsearch-6.3.0.jar:6.3.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: java.lang.IllegalStateException: found existing value for PerFieldPostingsFormat.format, field=note:note.fulltext, old=Lucene50, new=Lucene50
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.buildFieldsGroupMapping(PerFieldPostingsFormat.java:226) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:152) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:230) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:115) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4443) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4083) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:99) ~[elasticsearch-6.3.0.jar:6.3.0]
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
[2018-06-26T10:12:59,913][WARN ][o.e.i.c.IndicesClusterStateService] [dZQ-7Yb] [[nuxeo][0]] marking and sending shard failed due to [shard failure, reason [merge failed]]
...
[2018-06-26T10:14:55,500][INFO ][o.e.c.r.a.AllocationService] [dZQ-7Yb] Cluster health status changed from [GREEN] to [RED] (reason: [shards failed [[nuxeo][0]] ...]).

Could you describe what your integration test is doing?
Can you reproduce it with a simple Kibana dev console script we can use?

Hi David,

The integration test is submitting concurrently indexing and bulk indexing requests,
I can trace these requests in debug mode but because the problem happens after few minutes of injection, the log is huge ~640m and long bulk request are truncated, so it is not possible to create a script from this log, you can find an extract of this log as an attached file in this jira ticket.

Also I am not able to reproduce the problem on my machine running the same integration test so even if I could script the requests with proper concurrency I am not sure it will reproduce the problem.

May be lucene experts have an idea of what can lead to such exception ?

What this gives on the machine you are running the test?

java -version

The elasticsearch is dockerized it uses
image: docker.elastic.co/elasticsearch/elasticsearch:6.3.0
which comes with:

[root@8e4b63da9de9 elasticsearch]# /opt/jdk-10.0.1/bin/java -version
openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10)
OpenJDK 64-Bit Server VM (build 10.0.1+10, mixed mode)

Note that when using es 6.2.4 (docker.elastic.co/elasticsearch/elasticsearch:6.2.4) where the problem is not present the java version is:

[root@7bcc20c01eee elasticsearch]# java -version
openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

Are you running an AWS on any chance? If yes, what instance type?

Yes, the slave in on AWS, it's a c5.2xlarge instance.
Ubuntu 16.04
Linux ip-XXXX 4.4.0-1061-aws #70-Ubuntu SMP Fri May 25 21:47:34 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
docker-ce 18.03.1~ce-0~ubuntu

Thanks for confirming. We are observing trouble on these instances with Java 10: https://github.com/elastic/elasticsearch-docker/issues/179.

1 Like

Indeed, if I change the instance type from c5.2xlarge to c4.2xlarge it fixes the problem.
This confirms that docker.elastic.co/elasticsearch/elasticsearch:6.3.0 should be avoided on amazon c5 instance type.
It could be nice to have a docker image with java 8 when changing the instance type is not an option.
Thanks a lot.

I just had the same problem in a cluster using the official elasticsearch-oss:6.3.1 docker image on standard centos 7 machines (kernel 3.x). I Had to restore a backup and stay with 6.2.4.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.