Elasticsearch merging woes

Howdy, we're having some trouble with Elasticsearch bulk indexing, which
appears to slow to a virtual stand-still about once a week or more. We're
using a cached data structure that is backed by Elasticsearch, so I don't
know exactly how often we're hitting it. Whenever there is a cache miss,
and the cached data structure is updated thousands of times per second,
Elasticsearch is queried and updated.

We're running a four-node cluster using Elasticsearch 19.8

Right before things slow to a stand-still, we see these warnings in the
logs:

[19:43:57,851][WARN ][index.merge.scheduler ] [Droom, Doctor Anthony]
[analytics][11] failed to merge
java.io.FileNotFoundException: _guwj_1.del
at
org.elasticsearch.index.store.Store$StoreDirectory.fileLength(Store.java:448)
at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:303)
at
org.apache.lucene.index.MergePolicy$OneMerge.totalBytesSize(MergePolicy.java:174)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:81)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

The cluster then slows, although it continues to report "green" status for
all indices and respond to searches.

It's always a ".del" file, and it's always a different one. I have no idea
what that means, but I should note that we don't delete data on an ongoing
basis on this cluster. The shard name, 11 in this case, always changes.
Without exception, nodes that have these messages at the end of their logs
do not respond to SIGTERM and must be forcibly killed, while other nodes
are just fine. After a restart, the cluster is always happy and responsive
again.

Here's the result of curling localhost:9200/_settings. As you can see, we
have two replicas on the analytics index.

{
"ad_activity_test" : {
"settings" : {
"index.number_of_shards" : "12",
"index.number_of_replicas" : "1",
"index.version.created" : "190899"
}
},
"ad_activity" : {
"settings" : {
"index.number_of_shards" : "12",
"index.number_of_replicas" : "1",
"index.version.created" : "190899"
}
},
"analytics_20130207" : {
"settings" : {
"index.number_of_shards" : "12",
"index.number_of_replicas" : "2",
"index.version.created" : "190899",
"index.routing.allocation.total_shards_per_node" : "12"
}
},
"analytics" : {
"settings" : {
"index.merge.policy.segments_per_tier" : "5",
"index.number_of_replicas" : "2",
"index.version.created" : "190899",
"index.number_of_shards" : "12",
"index.routing.allocation.total_shards_per_node" : "12"
}
},
"activity" : {
"settings" : {
"index.number_of_replicas" : "2",
"index.version.created" : "190899",
"index.number_of_shards" : "12",
"index.routing.allocation.total_shards_per_node" : "12"
}
}
}

Does anyone have any ideas? Is this a better question for the Lucene
mailing list? Thanks for any help.

Best,
Josh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hm... looks like Elasticsearch TrackingConcurrentMergeScheduler is
affected by https://issues.apache.org/jira/browse/LUCENE-3051

Which Java JVM are you running?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

That sure looks like a similar issue, but Elasticsearch 19.8 depends on
Lucene 3.6.0, and I see that the issue you mention is fixed in 3.2. We're
using Oracle Java java version "1.6.0_37."

On Tuesday, February 12, 2013 12:41:08 PM UTC-6, Jörg Prante wrote:

Hm... looks like Elasticsearch TrackingConcurrentMergeScheduler is
affected by [LUCENE-3051] don't call SegmentInfo.sizeInBytes for the merging segments - ASF JIRA

Which Java JVM are you running?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The bug was fixed by ensuring an IndexWriter synchronization on each
SegmentInfos usage, this does not necessarily cover the Elasticsearch
code of TrackingConcurrentMergeScheduler. There is a long story of
accessing SegmentInfos, see also
https://issues.apache.org/jira/browse/LUCENE-1175

I'm not sure what is happening. It looks like a rare thread
synchronization issue. Maybe induced by certain JVMs under heavy load,
maybe not. The exception was reported few times to the list in the last
year.

It is a good idea to run CheckIndex on the affected node to find out if
the index is broken.

index.shard.check_on_startup: true

Updating Java 6 to Java 7 is a shot in the dark, maybe it improves the
situation.

Best regards,

Jörg

Am 12.02.13 20:10, schrieb Josh Bronson:

That sure looks like a similar issue, but Elasticsearch 19.8 depends
on Lucene 3.6.0, and I see that the issue you mention is fixed in 3.2.
We're using Oracle Java java version "1.6.0_37."

On Tuesday, February 12, 2013 12:41:08 PM UTC-6, Jörg Prante wrote:

Hm... looks like Elasticsearch TrackingConcurrentMergeScheduler is
affected by https://issues.apache.org/jira/browse/LUCENE-3051
<https://issues.apache.org/jira/browse/LUCENE-3051>

Which Java JVM are you running?

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I put a response on [LUCENE-3051] don't call SegmentInfo.sizeInBytes for the merging segments - ASF JIRA, and
opened a new issue (LUCENE-4775) to try to reduce this trap.

Mike

http://blog.mikemccandless.com

On Tue, Feb 12, 2013 at 2:54 PM, Jörg Prante joergprante@gmail.com wrote:

The bug was fixed by ensuring an IndexWriter synchronization on each
SegmentInfos usage, this does not necessarily cover the Elasticsearch code
of TrackingConcurrentMergeSchedul**er. There is a long story of accessing
SegmentInfos, see also https://issues.apache.org/**jira/browse/LUCENE-1175https://issues.apache.org/jira/browse/LUCENE-1175

I'm not sure what is happening. It looks like a rare thread
synchronization issue. Maybe induced by certain JVMs under heavy load,
maybe not. The exception was reported few times to the list in the last
year.

It is a good idea to run CheckIndex on the affected node to find out if
the index is broken.

index.shard.check_on_startup: true

Updating Java 6 to Java 7 is a shot in the dark, maybe it improves the
situation.

Best regards,

Jörg

Am 12.02.13 20:10, schrieb Josh Bronson:

That sure looks like a similar issue, but Elasticsearch 19.8 depends on

Lucene 3.6.0, and I see that the issue you mention is fixed in 3.2. We're
using Oracle Java java version "1.6.0_37."

On Tuesday, February 12, 2013 12:41:08 PM UTC-6, Jörg Prante wrote:

Hm... looks like Elasticsearch TrackingConcurrentMergeSchedul**er is
affected by https://issues.apache.org/**jira/browse/LUCENE-3051<https://issues.apache.org/jira/browse/LUCENE-3051>
<https://issues.apache.org/**jira/browse/LUCENE-3051<https://issues.apache.org/jira/browse/LUCENE-3051>
Which Java JVM are you running?

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks to both of you for the help! We'll do the best we can with current
versions of Elasticsearch, and I'll look into CheckIndex. I suspect we'll
have to make our current version work, but in case it's helpful, I've
opened an issue on the Elasticsearch github tracker here:

Best,
Josh

On Tuesday, February 12, 2013 2:37:56 PM UTC-6, Michael McCandless wrote:

I put a response on [LUCENE-3051] don't call SegmentInfo.sizeInBytes for the merging segments - ASF JIRA,
and opened a new issue (LUCENE-4775) to try to reduce this trap.

Mike

http://blog.mikemccandless.com

On Tue, Feb 12, 2013 at 2:54 PM, Jörg Prante <joerg...@gmail.com<javascript:>

wrote:

The bug was fixed by ensuring an IndexWriter synchronization on each
SegmentInfos usage, this does not necessarily cover the Elasticsearch code
of TrackingConcurrentMergeScheduler. There is a long story of
accessing SegmentInfos, see also https://issues.apache.org/

jira/browse/LUCENE-1175https://issues.apache.org/jira/browse/LUCENE-1175

I'm not sure what is happening. It looks like a rare thread
synchronization issue. Maybe induced by certain JVMs under heavy load,
maybe not. The exception was reported few times to the list in the last
year.

It is a good idea to run CheckIndex on the affected node to find out if
the index is broken.

index.shard.check_on_startup: true

Updating Java 6 to Java 7 is a shot in the dark, maybe it improves the
situation.

Best regards,

Jörg

Am 12.02.13 20:10, schrieb Josh Bronson:

That sure looks like a similar issue, but Elasticsearch 19.8 depends on

Lucene 3.6.0, and I see that the issue you mention is fixed in 3.2. We're
using Oracle Java java version "1.6.0_37."

On Tuesday, February 12, 2013 12:41:08 PM UTC-6, Jörg Prante wrote:

Hm... looks like Elasticsearch TrackingConcurrentMergeSchedul**er is
affected by https://issues.apache.org/**jira/browse/LUCENE-3051<https://issues.apache.org/jira/browse/LUCENE-3051>
<https://issues.apache.org/**jira/browse/LUCENE-3051<https://issues.apache.org/jira/browse/LUCENE-3051>
Which Java JVM are you running?

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@**googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.