Java failure in the middle of index restore process

eugene_pirogov · December 7, 2017, 2:44pm

After rebooting elasticsearch, I am now unable to restore it into a working state. Below are my startup logs, mostly to depict versions and JVM arguments.

What I'm seeing is the following:

as I restart elasticsearch,

elasticsearch shows all indexes as red and begins to recover,

curl 'http://localhost:9200/_cat/indices?v'
health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
red    open   logstash-2017.12.07 ISsPMF-BS_-1XmbWblF6rQ   5   1    2582599            0      1.3gb          1.3gb
red    open   logstash-2017.12.05 B6LD_FX2QmOjF2u32Hp-vg   5   1
red    open   logstash-2017.12.06 dTwfnFjmQY2OFwlLo-ADuw   5   1     906000            0    374.5mb        374.5mb
red    open   .kibana             YSrJaSkXQFifLErsBsq66Q   1   1

a few minutes into the recovery process, a Java error appears:

[2017-12-07T14:26:40,233][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [elasticsearch-
logging-0] fatal error in thread [Thread-42], exiting
java.lang.OutOfMemoryError: Java heap space
... huge stacktrack here, omitted for brevity ...

Then elasticsearch restarts and everything repeats over again (begin recovery, move along, catch java exception).

How can I fix this?

Random thoughts:

my elasticsearch setup is about storing logs,
I'm not interested in full-text search of my logs. I'm using elastcisearch/kibana to store & perform key/value search deelpy ("my_json.key.another_key.another_deep_key=some_value") inside json documents.
shortly before the first elasticsearch crash, I had a spike in traffic to elasticsearch: a misconfigured log agent would go crazy and send tons of unnecessary logs.
in particular, logstash-2017.12.07 index grew to several gigabytes, while normally the daily index size is roughly 800 megabytes.

Naive ideas so far:

try to increase the RAM cap for elasticsearch – is there an evidence this will help?,
lower the amount of memory elasticsearch can use at a time (?) – this would probably increase the time to restore, but at least should not raise Java exception?,
try to clean up unusually big index, by removing spammy logs from it – is this feasible? if yes, how?
try to tune whatever Java or Elasticsearch tunable to aid this – this is a new territory for me, so any advice is hugely appreciated.

P. S. For what it's worth, I'm running an elasticsearch in Kubernetes.

Here's the startup log:

+ export NODE_NAME=elasticsearch-logging-0
+ NODE_NAME=elasticsearch-logging-0
+ export NODE_MASTER=true
+ NODE_MASTER=true
+ export NODE_DATA=true
+ NODE_DATA=true
+ export HTTP_PORT=9200
+ HTTP_PORT=9200
+ export TRANSPORT_PORT=9300
+ TRANSPORT_PORT=9300
+ export MINIMUM_MASTER_NODES=1
+ MINIMUM_MASTER_NODES=1
+ chown -R elasticsearch:elasticsearch /data
+ ./bin/elasticsearch_logging_discovery
+ exec su elasticsearch -c ./bin/es-docker
[2017-12-07T14:20:24,415][INFO ][o.e.n.Node               ] [elasticsearch-logging-0] initializing ...
[2017-12-07T14:20:24,681][INFO ][o.e.e.NodeEnvironment    ] [elasticsearch-logging-0] using [1] data paths, mounts [[/data (/dev/sdj)]], net usable_space [36.6gb], net total_space [48.9gb], spins? [possibly], types [ext4]
[2017-12-07T14:20:24,681][INFO ][o.e.e.NodeEnvironment    ] [elasticsearch-logging-0] heap size [1.9gb], compressed ordinary object pointers [true]
[2017-12-07T14:20:24,700][INFO ][o.e.n.Node               ] [elasticsearch-logging-0] node name [elasticsearch-logging-0], node ID [EtmvkcJ3SA-Toa8P0qcAjA]
[2017-12-07T14:20:24,700][INFO ][o.e.n.Node               ] [elasticsearch-logging-0] version[5.6.2], pid[16], build[57e20f3/2017-09-23T13:16:45.703Z], OS[Linux/4.4.64+/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_141/25.141-b16]
[2017-12-07T14:20:24,700][INFO ][o.e.n.Node               ] [elasticsearch-logging-0] JVM arguments [-Xms2g, -Xmx2g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.cgroups.hierarchy.override=/, -Des.path.home=/usr/share/elasticsearch]
[2017-12-07T14:20:29,110][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [aggs-matrix-stats]
[2017-12-07T14:20:29,111][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [ingest-common]
[2017-12-07T14:20:29,111][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [lang-expression]
[2017-12-07T14:20:29,111][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [lang-groovy]
[2017-12-07T14:20:29,111][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [lang-mustache]
[2017-12-07T14:20:29,111][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [lang-painless]
[2017-12-07T14:20:29,111][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [parent-join]
[2017-12-07T14:20:29,111][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [percolator]
[2017-12-07T14:20:29,111][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [reindex]
[2017-12-07T14:20:29,112][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [transport-netty3]
[2017-12-07T14:20:29,112][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded module [transport-netty4]
[2017-12-07T14:20:29,112][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded plugin [ingest-geoip]
[2017-12-07T14:20:29,112][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded plugin [ingest-user-agent]
[2017-12-07T14:20:29,112][INFO ][o.e.p.PluginsService     ] [elasticsearch-logging-0] loaded plugin [x-pack]
[2017-12-07T14:20:36,218][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [controller/65] [Main.cc@128] controller (64 bit): Version 5.6.2 (Build 228329870d1c63) Copyright (c) 2017 Elasticsearch BV
[2017-12-07T14:20:36,225][INFO ][o.e.d.DiscoveryModule    ] [elasticsearch-logging-0] using discovery type [zen]
[2017-12-07T14:20:38,405][INFO ][o.e.n.Node               ] [elasticsearch-logging-0] initialized
[2017-12-07T14:20:38,405][INFO ][o.e.n.Node               ] [elasticsearch-logging-0] starting ...
[2017-12-07T14:20:38,814][INFO ][o.e.t.TransportService   ] [elasticsearch-logging-0] publish_address {10.0.2.25:9300}, bound_addresses {[::]:9300}

dadoonet · December 7, 2017, 3:02pm

What are your current memory settings?

try to increase the RAM cap for elasticsearch – is there an evidence this will help?,

Yes according to "java.lang.OutOfMemoryError: Java heap space"

try to clean up unusually big index, by removing spammy logs from it

Might be better to remove the index entirely.

eugene_pirogov · December 7, 2017, 3:11pm

@dadoonet thank you, David!

What are your current memory settings?

As far as I can tell from the startup logs, I have heap size equal to 1.9gb. Is there a place where I can get a definitive answer to the question? Also, what parameter & where do I tune it to increase the cap?

Might be better to remove the index entirely.

A couple of questions before I am to throw away a day of logs.

would it be possible to pause the recovery of a particular index? If I wait till all other indices are recovered, do you think resuming recovery of the big index may finish successfully?
how hard would it be copy the indices to a remote location with more RAM (my laptop) and clean it up? Is there a common way to do this?

dadoonet · December 7, 2017, 3:59pm

I see. You can change that in the config/jvm.options file.

how hard would it be copy the indices to a remote location with more RAM (my laptop) and clean it up? Is there a common way to do this?

Cleaning up an index partially is not really ideal. Because DELETE is not really removing in elasticsearch, it's more adding something saying that a document is not anymore available. Then eventually the "marked as deleted" data will be removed.

eugene_pirogov · December 8, 2017, 12:04pm

I've finally solved the problem.

Initially I forgot to mention that I had:

two masters,
agents still up & running and writing logs to elasticsearch.

At the time I was writing the question, I turned the second master off to simplify the equation (sort of) and minimize the amount of moving parts.

Here's my poor man's algorithm for resolving this:

stopped the agents – to ensure there was no more traffic to elasticsearch,
restarted elasticsearch,
waited for recovery, until it did all it could, e.g. no more documents were recovered to indexes. I just looked at numbers from:
```
curl 'http://localhost:9200/_cat/indices?v'
```
at this point all indexes were still red.
attached another master (as I suspected some documents were initially left on the second master, being the reason why I was shown "red" cluster status by the above /_cat/indices),
waited until my initial master got all the documents and the status for all indices became yellow,
waited until the second master copied all the documents, and the status for cluster became green.

By the time elasticsearch was green, the RAM usage on the machine was showing around 17Gb, meaning that the java heap size of 1.9Gb wasn't actually the elasticsearch memory cap. Java heap size probably means something else, I'm not sure what exactly.

TL;DR: so, in the end, it looks like I could restore elasticsearch cluster by putting less burden on it (turning off agents while restoring).

system · January 5, 2018, 12:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Crash after a few days Elasticsearch	5	1472	October 17, 2017
Correct way to restart cluster / rejoin failed nodes Elasticsearch	5	1248	July 6, 2017
Unable to index new documents after cluster restart Elasticsearch	3	728	July 6, 2017
Memory Explosion: Heap Dump in less than one minute Elasticsearch	8	967	July 6, 2017
Debug " Request Timeout after 1500ms" Elasticsearch	20	3844	July 5, 2017

Java failure in the middle of index restore process

Related topics