our server crashed every two weeks or so and I thought it was a
hardware failure or something similar. Now it gets even worse: the
servers crashes nearly every two days - I think this gets worse after
I'm triggering a query every 20 seconds to get the index "warmed".
There were no specific messages in ES logs or elsewhere. Also memory
consumption of ES was always within the limit. I also cannot imagine
that a Java program can kill the OS ...
We've investigated a lot. RAM check (no errors found), no hard-disc
error logs found, enough RAM (7.5 available but crashed with 4.5g),
updated to latest master of ES from April, 17th and it keeps still
crashing. I didn't found a hs err pid log. This would be in the folder
where I started ES, right?
Now I have created every minute a node stats snapshot (see https://gist.github.com/926242) and the server crashed today at ~
19.16 It was 10 minutes before the next optimize (I marked the first
optimize with optimize at 07:29)
Can someone see a problem with ES there? Or point me into a direction
where I can try to fix the problem? Should I try a complete reindex?
I'm kind of stuck now.
I've now reindexed the 6mio tweets (via scan). We'll see if this
improves anything.
(E.g. due to different lucene version, different hard disc areas
whatever)
our server crashed every two weeks or so and I thought it was a
hardware failure or something similar. Now it gets even worse: the
servers crashes nearly every two days - I think this gets worse after
I'm triggering a query every 20 seconds to get the index "warmed".
There were no specific messages in ES logs or elsewhere. Also memory
consumption of ES was always within the limit. I also cannot imagine
that a Java program can kill the OS ...
We've investigated a lot. RAM check (no errors found), no hard-disc
error logs found, enough RAM (7.5 available but crashed with 4.5g),
updated to latest master of ES from April, 17th and it keeps still
crashing. I didn't found a hs err pid log. This would be in the folder
where I started ES, right?
Now I have created every minute a node stats snapshot (seehttps://gist.github.com/926242) and the server crashed today at ~
19.16 It was 10 minutes before the next optimize (I marked the first
optimize with optimize at 07:29)
Can someone see a problem with ES there? Or point me into a direction
where I can try to fix the problem? Should I try a complete reindex?
I'm kind of stuck now.
I'm out of luck now. Reindexing didn't help.
Also replacing the SSD disc with a normal disc didn't help: although
today the system didn't crash
ES became unuseable due to massive CPU usage (garbage collection I
guess as it had max RAM)
I've now reindexed the 6mio tweets (via scan). We'll see if this
improves anything.
(E.g. due to different lucene version, different hard disc areas
whatever)
our server crashed every two weeks or so and I thought it was a
hardware failure or something similar. Now it gets even worse: the
servers crashes nearly every two days - I think this gets worse after
I'm triggering a query every 20 seconds to get the index "warmed".
There were no specific messages in ES logs or elsewhere. Also memory
consumption of ES was always within the limit. I also cannot imagine
that a Java program can kill the OS ...
We've investigated a lot. RAM check (no errors found), no hard-disc
error logs found, enough RAM (7.5 available but crashed with 4.5g),
updated to latest master of ES from April, 17th and it keeps still
crashing. I didn't found a hs err pid log. This would be in the folder
where I started ES, right?
Now I have created every minute a node stats snapshot (seehttps://gist.github.com/926242) and the server crashed today at ~
19.16 It was 10 minutes before the next optimize (I marked the first
optimize with optimize at 07:29)
Can someone see a problem with ES there? Or point me into a direction
where I can try to fix the problem? Should I try a complete reindex?
I'm kind of stuck now.
The interesting part is at 10:32:01 where it indeed shows that gc is
very active:
collection time ~1min, collection threads: "threads":{"count":
279,"peak_count":279}
normal was: "threads":{"count":67,"peak_count":91}
For me the system now has enough memory:
"mem":{"heap_used":"5.4gb","heap_used_in_bytes":
5852028200,"heap_committed":"7gb"
BUT gc activity seems to increase nevertheless! In the logs you'll
find:
"threads":{"count":361,"peak_count":361}
I'm out of luck now. Reindexing didn't help.
Also replacing the SSD disc with a normal disc didn't help: although
today the system didn't crash
ES became unuseable due to massive CPU usage (garbage collection I
guess as it had max RAM)
I've now reindexed the 6mio tweets (via scan). We'll see if this
improves anything.
(E.g. due to different lucene version, different hard disc areas
whatever)
our server crashed every two weeks or so and I thought it was a
hardware failure or something similar. Now it gets even worse: the
servers crashes nearly every two days - I think this gets worse after
I'm triggering a query every 20 seconds to get the index "warmed".
There were no specific messages in ES logs or elsewhere. Also memory
consumption of ES was always within the limit. I also cannot imagine
that a Java program can kill the OS ...
We've investigated a lot. RAM check (no errors found), no hard-disc
error logs found, enough RAM (7.5 available but crashed with 4.5g),
updated to latest master of ES from April, 17th and it keeps still
crashing. I didn't found a hs err pid log. This would be in the folder
where I started ES, right?
Now I have created every minute a node stats snapshot (seehttps://gist.github.com/926242) and the server crashed today at ~
19.16 It was 10 minutes before the next optimize (I marked the first
optimize with optimize at 07:29)
Can someone see a problem with ES there? Or point me into a direction
where I can try to fix the problem? Should I try a complete reindex?
I'm kind of stuck now.
thanks for your response! Removing all the explicit jvm settings
didn't help. Crashed (or was unresponsive) again.
I'll definitely try to avoid swapping although I'm not sure if this is
memory related at all.
Because I have now created a thread dump (while unresponsiveness)
where over 450 threads were created.
But the important point was: over 300 of them were in mvel methods!
Now I disabled the usage of mvel in my app ... we'll see tomorrow
What is locking memory? Does it mean the OS won't touch this memory
and move it to swap? Or how is this different to swapoff -a?
The indication that something could be wrong with mvel comes from the
thread dump mentioned before.
Where 70% of the >300 threads where in the appended state ** sometimes
with the first two lines (put + addEntry). sometimes not.
One of the two dumps contains two threads which were blocked:
java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.HashMap.resize(HashMap.java:462)
at java.util.HashMap.addEntry(HashMap.java:755)
at java.util.HashMap.put(HashMap.java:385)
at
org.elasticsearch.common.mvel2.integration.impl.ClassImportResolverFactory.(ClassImportResolverFactory.java:
49)
at
org.elasticsearch.common.mvel2.compiler.CompiledExpression.getValue(CompiledExpression.java:
125)
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.elasticsearch.common.mvel2.integration.impl.ClassImportResolverFactory.(ClassImportResolverFactory.java:
49)
at
org.elasticsearch.common.mvel2.compiler.CompiledExpression.getValue(CompiledExpression.java:
125)
I can also post the full dumps if you like or we should just wait
until monday ... if it crashes again something different is the
problem.
Regards,
Peter.
**
"elasticsearch[search]-pool-3-thread-328" daemon prio=10
tid=0x00007f0d6c82a800 nid=0x4e6a runnable [0x00007f0d5a7e7000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.addEntry(HashMap.java:753)
at java.util.HashMap.put(HashMap.java:385)
at
org.elasticsearch.common.mvel2.integration.impl.ClassImportResolverFactory.(ClassImportResolverFactory.java:
49)
at
org.elasticsearch.common.mvel2.compiler.CompiledExpression.getValue(CompiledExpression.java:
125)
at org.elasticsearch.script.mvel.MvelScriptEngineService
$MvelSearchScript.run(MvelScriptEngineService.java:168)
at org.elasticsearch.script.mvel.MvelScriptEngineService
$MvelSearchScript.runAsDouble(MvelScriptEngineService.java:180)
at
org.elasticsearch.search.facet.termsstats.strings.TermsStatsStringFacetCollector
$ScriptAggregator.onValue(TermsStatsStringFacetCollector.java:213)
at
org.elasticsearch.index.field.data.strings.MultiValueStringFieldData.forEachValueInDoc(MultiValueStringFieldData.java:
79)
at
org.elasticsearch.search.facet.termsstats.strings.TermsStatsStringFacetCollector.doCollect(TermsStatsStringFacetCollector.java:
126)
at
org.elasticsearch.search.facet.AbstractFacetCollector.collect(AbstractFacetCollector.java:
74)
at
org.elasticsearch.common.lucene.MultiCollector.collect(MultiCollector.java:
57)
at org.apache.lucene.search.Scorer.score(Scorer.java:90)
at org.apache.lucene.search.ConstantScoreQuery
$ConstantScorer.score(ConstantScoreQuery.java:241)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:519)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:
169)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:384)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:291)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:279)
at
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:211)
at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:
222)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:
134)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:
at
org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$2.run(TransportSearchTypeAction.java:169)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Lemme ping mike from mvel about this, had a brief look at the code, might be a thread safety issue.
On Thursday, April 21, 2011 at 11:27 PM, Karussell wrote:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.