ElasticSearch crashes OS?


(Karussell) #1

Hi there,

our server crashed every two weeks or so and I thought it was a
hardware failure or something similar. Now it gets even worse: the
servers crashes nearly every two days - I think this gets worse after
I'm triggering a query every 20 seconds to get the index "warmed".

There were no specific messages in ES logs or elsewhere. Also memory
consumption of ES was always within the limit. I also cannot imagine
that a Java program can kill the OS ...

We've investigated a lot. RAM check (no errors found), no hard-disc
error logs found, enough RAM (7.5 available but crashed with 4.5g),
updated to latest master of ES from April, 17th and it keeps still
crashing. I didn't found a hs err pid log. This would be in the folder
where I started ES, right?

Now I have created every minute a node stats snapshot (see
https://gist.github.com/926242) and the server crashed today at ~
19.16 It was 10 minutes before the next optimize (I marked the first
optimize with optimize at 07:29)

Can someone see a problem with ES there? Or point me into a direction
where I can try to fix the problem? Should I try a complete reindex?
I'm kind of stuck now.

Regards,
Peter.


(Karussell) #2

I've now reindexed the 6mio tweets (via scan). We'll see if this
improves anything.
(E.g. due to different lucene version, different hard disc areas
whatever)

Regards,
Peter.

On 19 Apr., 00:00, Karussell tableyourt...@googlemail.com wrote:

Hi there,

our server crashed every two weeks or so and I thought it was a
hardware failure or something similar. Now it gets even worse: the
servers crashes nearly every two days - I think this gets worse after
I'm triggering a query every 20 seconds to get the index "warmed".

There were no specific messages in ES logs or elsewhere. Also memory
consumption of ES was always within the limit. I also cannot imagine
that a Java program can kill the OS ...

We've investigated a lot. RAM check (no errors found), no hard-disc
error logs found, enough RAM (7.5 available but crashed with 4.5g),
updated to latest master of ES from April, 17th and it keeps still
crashing. I didn't found a hs err pid log. This would be in the folder
where I started ES, right?

Now I have created every minute a node stats snapshot (seehttps://gist.github.com/926242) and the server crashed today at ~
19.16 It was 10 minutes before the next optimize (I marked the first
optimize with optimize at 07:29)

Can someone see a problem with ES there? Or point me into a direction
where I can try to fix the problem? Should I try a complete reindex?
I'm kind of stuck now.

Regards,
Peter.


(Karussell) #3

I'm out of luck now. Reindexing didn't help.
Also replacing the SSD disc with a normal disc didn't help: although
today the system didn't crash
ES became unuseable due to massive CPU usage (garbage collection I
guess as it had max RAM)

All in all really strange.

Regards,
Peter.

On Apr 19, 1:28 am, Karussell tableyourt...@googlemail.com wrote:

I've now reindexed the 6mio tweets (via scan). We'll see if this
improves anything.
(E.g. due to different lucene version, different hard disc areas
whatever)

Regards,
Peter.

On 19 Apr., 00:00, Karussell tableyourt...@googlemail.com wrote:

Hi there,

our server crashed every two weeks or so and I thought it was a
hardware failure or something similar. Now it gets even worse: the
servers crashes nearly every two days - I think this gets worse after
I'm triggering a query every 20 seconds to get the index "warmed".

There were no specific messages in ES logs or elsewhere. Also memory
consumption of ES was always within the limit. I also cannot imagine
that a Java program can kill the OS ...

We've investigated a lot. RAM check (no errors found), no hard-disc
error logs found, enough RAM (7.5 available but crashed with 4.5g),
updated to latest master of ES from April, 17th and it keeps still
crashing. I didn't found a hs err pid log. This would be in the folder
where I started ES, right?

Now I have created every minute a node stats snapshot (seehttps://gist.github.com/926242) and the server crashed today at ~
19.16 It was 10 minutes before the next optimize (I marked the first
optimize with optimize at 07:29)

Can someone see a problem with ES there? Or point me into a direction
where I can try to fix the problem? Should I try a complete reindex?
I'm kind of stuck now.

Regards,
Peter.


(Karussell) #4

Forgot to mention the stats log: https://gist.github.com/930842

The interesting part is at 10:32:01 where it indeed shows that gc is
very active:
collection time ~1min, collection threads: "threads":{"count":
279,"peak_count":279}
normal was: "threads":{"count":67,"peak_count":91}

For me the system now has enough memory:
"mem":{"heap_used":"5.4gb","heap_used_in_bytes":
5852028200,"heap_committed":"7gb"

BUT gc activity seems to increase nevertheless! In the logs you'll
find:
"threads":{"count":361,"peak_count":361}

Now, I will try to tune jvm GC settings ...

Regards,
Peter.

On Apr 20, 11:19 am, Karussell tableyourt...@googlemail.com wrote:

I'm out of luck now. Reindexing didn't help.
Also replacing the SSD disc with a normal disc didn't help: although
today the system didn't crash
ES became unuseable due to massive CPU usage (garbage collection I
guess as it had max RAM)

All in all really strange.

Regards,
Peter.

On Apr 19, 1:28 am, Karussell tableyourt...@googlemail.com wrote:

I've now reindexed the 6mio tweets (via scan). We'll see if this
improves anything.
(E.g. due to different lucene version, different hard disc areas
whatever)

Regards,
Peter.

On 19 Apr., 00:00, Karussell tableyourt...@googlemail.com wrote:

Hi there,

our server crashed every two weeks or so and I thought it was a
hardware failure or something similar. Now it gets even worse: the
servers crashes nearly every two days - I think this gets worse after
I'm triggering a query every 20 seconds to get the index "warmed".

There were no specific messages in ES logs or elsewhere. Also memory
consumption of ES was always within the limit. I also cannot imagine
that a Java program can kill the OS ...

We've investigated a lot. RAM check (no errors found), no hard-disc
error logs found, enough RAM (7.5 available but crashed with 4.5g),
updated to latest master of ES from April, 17th and it keeps still
crashing. I didn't found a hs err pid log. This would be in the folder
where I started ES, right?

Now I have created every minute a node stats snapshot (seehttps://gist.github.com/926242) and the server crashed today at ~
19.16 It was 10 minutes before the next optimize (I marked the first
optimize with optimize at 07:29)

Can someone see a problem with ES there? Or point me into a direction
where I can try to fix the problem? Should I try a complete reindex?
I'm kind of stuck now.

Regards,
Peter.


(Clinton Gormley) #5

Hi Peter

Now, I will try to tune jvm GC settings ...

Before you do that, look at this post:

http://groups.google.com/a/elasticsearch.com/group/users/msg/b81532f844f7619d

clint


(Karussell) #6

Hi Clinton,

thanks for your response! Removing all the explicit jvm settings
didn't help. Crashed (or was unresponsive) again.

I'll definitely try to avoid swapping although I'm not sure if this is
memory related at all.

Because I have now created a thread dump (while unresponsiveness)
where over 450 threads were created.
But the important point was: over 300 of them were in mvel methods!
Now I disabled the usage of mvel in my app ... we'll see tomorrow :slight_smile:

What is locking memory? Does it mean the OS won't touch this memory
and move it to swap? Or how is this different to swapoff -a?

Regards,
Peter

On Apr 20, 11:54 am, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi Peter

Now, I will try to tune jvm GC settings ...

Before you do that, look at this post:

http://groups.google.com/a/elasticsearch.com/group/users/msg/b81532f8...

clint


(Clinton Gormley) #7

Hi Peter

What is locking memory? Does it mean the OS won't touch this memory
and move it to swap? Or how is this different to swapoff -a?

swapoff -a turns off swap for your whole system.

To guarantee that no swap is used for just one process, you can "lock"
that memory.

clint


(Shay Banon) #8

Peter, which JVM version are you using? Which vendor?
On Thursday, April 21, 2011 at 3:04 PM, Clinton Gormley wrote:

Hi Peter

What is locking memory? Does it mean the OS won't touch this memory
and move it to swap? Or how is this different to swapoff -a?

swapoff -a turns off swap for your whole system.

To guarantee that no swap is used for just one process, you can "lock"
that memory.

clint


(Karussell) #9

Hi Clinton, hi Shay,

swapoff -a turns off swap for your whole system.

ok, that would be ok.

I'm using 64bit from 'sun':

java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)


(Shay Banon) #10

Looks good. Anything in the logs to indicate why the OS is crashing?
On Thursday, April 21, 2011 at 5:53 PM, Karussell wrote:

Hi Clinton, hi Shay,

swapoff -a turns off swap for your whole system.

ok, that would be ok.

I'm using 64bit from 'sun':

java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)


(Karussell) #11

Anything in the logs to indicate why the OS is crashing?

Not really, but today it didn't crash and runs faster, because I did a
minor change.

I used mvel indirectly here:

facetBuilder =
FacetBuilders.termsStatsFacet(ff).keyField(ff).valueScript("doc.score").order(ComparatorType.TOTAL).size(limit);

now I'm using only:

FacetBuilders.termsFacet(ff).field(ff).size(limit);

The indication that something could be wrong with mvel comes from the
thread dump mentioned before.
Where 70% of the >300 threads where in the appended state ** sometimes
with the first two lines (put + addEntry). sometimes not.

One of the two dumps contains two threads which were blocked:

java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.HashMap.resize(HashMap.java:462)
at java.util.HashMap.addEntry(HashMap.java:755)
at java.util.HashMap.put(HashMap.java:385)
at
org.elasticsearch.common.mvel2.integration.impl.ClassImportResolverFactory.(ClassImportResolverFactory.java:
49)
at
org.elasticsearch.common.mvel2.compiler.CompiledExpression.getValue(CompiledExpression.java:
125)

java.lang.Thread.State: BLOCKED (on object monitor)
at
org.elasticsearch.common.mvel2.integration.impl.ClassImportResolverFactory.(ClassImportResolverFactory.java:
49)
at
org.elasticsearch.common.mvel2.compiler.CompiledExpression.getValue(CompiledExpression.java:
125)

I can also post the full dumps if you like or we should just wait
until monday :slight_smile: ... if it crashes again something different is the
problem.

Regards,
Peter.

**

"elasticsearch[search]-pool-3-thread-328" daemon prio=10
tid=0x00007f0d6c82a800 nid=0x4e6a runnable [0x00007f0d5a7e7000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.addEntry(HashMap.java:753)
at java.util.HashMap.put(HashMap.java:385)
at
org.elasticsearch.common.mvel2.integration.impl.ClassImportResolverFactory.(ClassImportResolverFactory.java:
49)
at
org.elasticsearch.common.mvel2.compiler.CompiledExpression.getValue(CompiledExpression.java:
125)
at org.elasticsearch.script.mvel.MvelScriptEngineService
$MvelSearchScript.run(MvelScriptEngineService.java:168)
at org.elasticsearch.script.mvel.MvelScriptEngineService
$MvelSearchScript.runAsDouble(MvelScriptEngineService.java:180)
at
org.elasticsearch.search.facet.termsstats.strings.TermsStatsStringFacetCollector
$ScriptAggregator.onValue(TermsStatsStringFacetCollector.java:213)
at
org.elasticsearch.index.field.data.strings.MultiValueStringFieldData.forEachValueInDoc(MultiValueStringFieldData.java:
79)
at
org.elasticsearch.search.facet.termsstats.strings.TermsStatsStringFacetCollector.doCollect(TermsStatsStringFacetCollector.java:
126)
at
org.elasticsearch.search.facet.AbstractFacetCollector.collect(AbstractFacetCollector.java:
74)
at
org.elasticsearch.common.lucene.MultiCollector.collect(MultiCollector.java:
57)
at org.apache.lucene.search.Scorer.score(Scorer.java:90)
at org.apache.lucene.search.ConstantScoreQuery
$ConstantScorer.score(ConstantScoreQuery.java:241)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:519)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:
169)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:384)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:291)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:279)
at
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:211)
at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:
222)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:
134)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:
at
org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$2.run(TransportSearchTypeAction.java:169)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


(Shay Banon) #12

Lemme ping mike from mvel about this, had a brief look at the code, might be a thread safety issue.
On Thursday, April 21, 2011 at 11:27 PM, Karussell wrote:

ClassImportResolverFactory


(Karussell) #13

I didn't viewed a thread-count peak again (thread count was normal
with ~60-70) but the system still crashes regularly without any
reason :frowning:

(A lot times I saw that CPU usage increases continuously until >700%
but today it crashed at ~130%)

Regards,
Peter.

On Apr 22, 1:03 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Lemme ping mike from mvel about this, had a brief look at the code, might be a thread safety issue.

On Thursday, April 21, 2011 at 11:27 PM, Karussell wrote:

ClassImportResolverFactory


(system) #14