Elasticsearch merging woes

Josh_Bronson_2 · February 12, 2013, 6:27pm

Howdy, we're having some trouble with Elasticsearch bulk indexing, which
appears to slow to a virtual stand-still about once a week or more. We're
using a cached data structure that is backed by Elasticsearch, so I don't
know exactly how often we're hitting it. Whenever there is a cache miss,
and the cached data structure is updated thousands of times per second,
Elasticsearch is queried and updated.

We're running a four-node cluster using Elasticsearch 19.8

Right before things slow to a stand-still, we see these warnings in the
logs:

[19:43:57,851][WARN ][index.merge.scheduler ] [Droom, Doctor Anthony]
[analytics][11] failed to merge
java.io.FileNotFoundException: _guwj_1.del
at
org.elasticsearch.index.store.Store$StoreDirectory.fileLength(Store.java:448)
at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:303)
at
org.apache.lucene.index.MergePolicy$OneMerge.totalBytesSize(MergePolicy.java:174)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:81)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

The cluster then slows, although it continues to report "green" status for
all indices and respond to searches.

It's always a ".del" file, and it's always a different one. I have no idea
what that means, but I should note that we don't delete data on an ongoing
basis on this cluster. The shard name, 11 in this case, always changes.
Without exception, nodes that have these messages at the end of their logs
do not respond to SIGTERM and must be forcibly killed, while other nodes
are just fine. After a restart, the cluster is always happy and responsive
again.

Here's the result of curling localhost:9200/_settings. As you can see, we
have two replicas on the analytics index.

{
"ad_activity_test" : {
"settings" : {
"index.number_of_shards" : "12",
"index.number_of_replicas" : "1",
"index.version.created" : "190899"
}
},
"ad_activity" : {
"settings" : {
"index.number_of_shards" : "12",
"index.number_of_replicas" : "1",
"index.version.created" : "190899"
}
},
"analytics_20130207" : {
"settings" : {
"index.number_of_shards" : "12",
"index.number_of_replicas" : "2",
"index.version.created" : "190899",
"index.routing.allocation.total_shards_per_node" : "12"
}
},
"analytics" : {
"settings" : {
"index.merge.policy.segments_per_tier" : "5",
"index.number_of_replicas" : "2",
"index.version.created" : "190899",
"index.number_of_shards" : "12",
"index.routing.allocation.total_shards_per_node" : "12"
}
},
"activity" : {
"settings" : {
"index.number_of_replicas" : "2",
"index.version.created" : "190899",
"index.number_of_shards" : "12",
"index.routing.allocation.total_shards_per_node" : "12"
}
}
}

Does anyone have any ideas? Is this a better question for the Lucene
mailing list? Thanks for any help.

Best,
Josh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 12, 2013, 6:41pm

Hm... looks like Elasticsearch TrackingConcurrentMergeScheduler is
affected by https://issues.apache.org/jira/browse/LUCENE-3051

Which Java JVM are you running?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Josh_Bronson_2 · February 12, 2013, 7:10pm

That sure looks like a similar issue, but Elasticsearch 19.8 depends on
Lucene 3.6.0, and I see that the issue you mention is fixed in 3.2. We're
using Oracle Java java version "1.6.0_37."

On Tuesday, February 12, 2013 12:41:08 PM UTC-6, Jörg Prante wrote:

Hm... looks like Elasticsearch TrackingConcurrentMergeScheduler is
affected by [LUCENE-3051] don't call SegmentInfo.sizeInBytes for the merging segments - ASF JIRA

Which Java JVM are you running?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 12, 2013, 7:54pm

The bug was fixed by ensuring an IndexWriter synchronization on each
SegmentInfos usage, this does not necessarily cover the Elasticsearch
code of TrackingConcurrentMergeScheduler. There is a long story of
accessing SegmentInfos, see also
https://issues.apache.org/jira/browse/LUCENE-1175

I'm not sure what is happening. It looks like a rare thread
synchronization issue. Maybe induced by certain JVMs under heavy load,
maybe not. The exception was reported few times to the list in the last
year.

It is a good idea to run CheckIndex on the affected node to find out if
the index is broken.

index.shard.check_on_startup: true

Updating Java 6 to Java 7 is a shot in the dark, maybe it improves the
situation.

Best regards,

Jörg

Am 12.02.13 20:10, schrieb Josh Bronson:

That sure looks like a similar issue, but Elasticsearch 19.8 depends
on Lucene 3.6.0, and I see that the issue you mention is fixed in 3.2.
We're using Oracle Java java version "1.6.0_37."

On Tuesday, February 12, 2013 12:41:08 PM UTC-6, Jörg Prante wrote:
Hm... looks like Elasticsearch TrackingConcurrentMergeScheduler is
affected by https://issues.apache.org/jira/browse/LUCENE-3051
<https://issues.apache.org/jira/browse/LUCENE-3051>

Which Java JVM are you running?

Jörg
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael_McCandless · February 12, 2013, 8:37pm

I put a response on [LUCENE-3051] don't call SegmentInfo.sizeInBytes for the merging segments - ASF JIRA, and
opened a new issue (LUCENE-4775) to try to reduce this trap.

Mike

http://blog.mikemccandless.com

On Tue, Feb 12, 2013 at 2:54 PM, Jörg Prante joergprante@gmail.com wrote:

The bug was fixed by ensuring an IndexWriter synchronization on each
SegmentInfos usage, this does not necessarily cover the Elasticsearch code
of TrackingConcurrentMergeSchedul**er. There is a long story of accessing
SegmentInfos, see also https://issues.apache.org/**jira/browse/LUCENE-1175 https://issues.apache.org/jira/browse/LUCENE-1175

I'm not sure what is happening. It looks like a rare thread
synchronization issue. Maybe induced by certain JVMs under heavy load,
maybe not. The exception was reported few times to the list in the last
year.

It is a good idea to run CheckIndex on the affected node to find out if
the index is broken.

index.shard.check_on_startup: true

Updating Java 6 to Java 7 is a shot in the dark, maybe it improves the
situation.

Best regards,

Jörg

Am 12.02.13 20:10, schrieb Josh Bronson:

That sure looks like a similar issue, but Elasticsearch 19.8 depends on
Lucene 3.6.0, and I see that the issue you mention is fixed in 3.2. We're
using Oracle Java java version "1.6.0_37."

On Tuesday, February 12, 2013 12:41:08 PM UTC-6, Jörg Prante wrote:
Hm... looks like Elasticsearch TrackingConcurrentMergeSchedul**er is
affected by https://issues.apache.org/**jira/browse/LUCENE-3051<https://issues.apache.org/jira/browse/LUCENE-3051>
<https://issues.apache.org/**jira/browse/LUCENE-3051<https://issues.apache.org/jira/browse/LUCENE-3051>
Which Java JVM are you running?

Jörg
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Josh_Bronson_2 · February 12, 2013, 8:59pm

Thanks to both of you for the help! We'll do the best we can with current
versions of Elasticsearch, and I'll look into CheckIndex. I suspect we'll
have to make our current version work, but in case it's helpful, I've
opened an issue on the Elasticsearch github tracker here:

github.com/elastic/elasticsearch

possibly incorrect use of Lucene OneMerge.totalBytesSize

opened 08:58PM - 12 Feb 13 UTC

closed 09:09PM - 12 Feb 13 UTC

joshbronson

>bug v0.90.0.Beta1 v0.20.5

... in TrackingConcurrentMergeScheduler. These stack traces in the logs always… precede extremely slow index times and the need to forcibly (with SIGKILL) restart Elasticsearch: [19:43:57,851][WARN ][index.merge.scheduler ] [Droom, Doctor Anthony] [analytics][11] failed to merge java.io.FileNotFoundException: _guwj_1.del at org.elasticsearch.index.store.Store$StoreDirectory.fileLength(Store.java:448) at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:303) at org.apache.lucene.index.MergePolicy$OneMerge.totalBytesSize(MergePolicy.java:174) at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:81) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456) Here are some who've been bitten by this bug: https://groups.google.com/d/topic/elasticsearch/NCLWvGEz6dk/discussion https://groups.google.com/d/topic/elasticsearch/7a0FKmqtbnM/discussion http://elasticsearch-users.115913.n3.nabble.com/quot-Failed-to-merge-quot-java-io-FileNotFoundException-td3654491.html http://elasticsearch-users.115913.n3.nabble.com/failed-to-mege-exception-td4021139.html It looks like Lucene is being used inappropriately by Elasticsearch. See the response here: https://issues.apache.org/jira/browse/LUCENE-3051?focusedCommentId=13576972&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13576972 We'll have to make this work with an older version of Elasticsearch, but I thought I'd raise a flag in case Lucene's suggested workaround (calling OneMerge.estimatedMergeBytes) is helpful. Best, Josh

Best,
Josh

On Tuesday, February 12, 2013 2:37:56 PM UTC-6, Michael McCandless wrote:

I put a response on [LUCENE-3051] don't call SegmentInfo.sizeInBytes for the merging segments - ASF JIRA,
and opened a new issue (LUCENE-4775) to try to reduce this trap.

Mike

http://blog.mikemccandless.com

On Tue, Feb 12, 2013 at 2:54 PM, Jörg Prante <joerg...@gmail.com<javascript:>

wrote:
The bug was fixed by ensuring an IndexWriter synchronization on each
SegmentInfos usage, this does not necessarily cover the Elasticsearch code
of TrackingConcurrentMergeScheduler. There is a long story of
accessing SegmentInfos, see also https://issues.apache.org/
jira/browse/LUCENE-1175https://issues.apache.org/jira/browse/LUCENE-1175

I'm not sure what is happening. It looks like a rare thread
synchronization issue. Maybe induced by certain JVMs under heavy load,
maybe not. The exception was reported few times to the list in the last
year.

It is a good idea to run CheckIndex on the affected node to find out if
the index is broken.

index.shard.check_on_startup: true

Updating Java 6 to Java 7 is a shot in the dark, maybe it improves the
situation.

Best regards,

Jörg

Am 12.02.13 20:10, schrieb Josh Bronson:

That sure looks like a similar issue, but Elasticsearch 19.8 depends on
Lucene 3.6.0, and I see that the issue you mention is fixed in 3.2. We're
using Oracle Java java version "1.6.0_37."

On Tuesday, February 12, 2013 12:41:08 PM UTC-6, Jörg Prante wrote:
Hm... looks like Elasticsearch TrackingConcurrentMergeSchedul**er is
affected by https://issues.apache.org/**jira/browse/LUCENE-3051<https://issues.apache.org/jira/browse/LUCENE-3051>
<https://issues.apache.org/**jira/browse/LUCENE-3051<https://issues.apache.org/jira/browse/LUCENE-3051>
Which Java JVM are you running?

Jörg
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@**googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
"Failed to merge" - java.io.FileNotFoundException Elasticsearch	4	578	July 6, 2017
"failed to merge" errors Elasticsearch	2	425	July 6, 2017
Elasticsearch becomes unresponsive during Lucene merges after bulk indexing Elasticsearch	1	1364	July 5, 2017
ConcurrentMergeScheduler exceptions in the log Elasticsearch	3	344	July 6, 2017
ElasticSearch high CPU on merge threads Elasticsearch	8	2593	July 5, 2017

Elasticsearch merging woes

Related topics