Out of memory? Exception is thrown and service shuts down

Hello,

I ran into issues with elasticsearch and I’m looking for advice or a solution.
My Ruby on Rails application implements elasticsearch through elasticsearch-rails gem.

Production server info:
Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz # 8 cores
16G RAM
200Gb Free disk space

# curl localhost:9200
{
  "name" : "qWfkDHn",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "bO29LJBOQKmOIM8CBsYDBQ",
  "version" : {
    "number" : "5.5.0",
    "build_hash" : "260387d",
    "build_date" : "2017-06-30T23:16:05.735Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.0"
  },
  "tagline" : "You Know, for Search"
}
# curl -XGET localhost:9200/_cat/indices
green  open people           Dq5QQ4rvR2yo6vKWPPCI3Q 1 0 436988   101  10.7gb  9.7gb
yellow open file_attachments p2bh604YSDy3yK1CioT9uA 5 1     55     4  923kb  923kb
green  open companies        QZ-O449uT4ypPVfcsq4Neg 1 0 658840 20640  720mb  720mb
yellow open external_links   881Q6MQVScOwMJndB4pI3g 5 1      6     0 50.2kb 50.2kb
yellow open personal_emails  5E5u9_KkSpSjy4dD-Fi7Xg 5 1     13     0 96.5kb 96.5kb
# curl -XGET localhost:9200/_cat/shards
personal_emails  5 shards 
personal_emails  4 p STARTED         4  29.4kb 127.0.0.1 qWfkDHn
personal_emails  4 r UNASSIGNED                          
personal_emails  2 p STARTED         2  14.8kb 127.0.0.1 qWfkDHn
personal_emails  2 r UNASSIGNED                          
personal_emails  3 p STARTED         1   7.8kb 127.0.0.1 qWfkDHn
personal_emails  3 r UNASSIGNED                          
personal_emails  1 p STARTED         2  14.9kb 127.0.0.1 qWfkDHn
personal_emails  1 r UNASSIGNED                          
personal_emails  0 p STARTED         4  29.5kb 127.0.0.1 qWfkDHn
personal_emails  0 r UNASSIGNED                          
external_links   4 p STARTED         0    194b 127.0.0.1 qWfkDHn
external_links   4 r UNASSIGNED                          
external_links   2 p STARTED         1   8.3kb 127.0.0.1 qWfkDHn
external_links   2 r UNASSIGNED                          
external_links   3 p STARTED         2  16.6kb 127.0.0.1 qWfkDHn
external_links   3 r UNASSIGNED                          
external_links   1 p STARTED         2  16.5kb 127.0.0.1 qWfkDHn
external_links   1 r UNASSIGNED                          
external_links   0 p STARTED         1   8.4kb 127.0.0.1 qWfkDHn
external_links   0 r UNASSIGNED                          
companies        0 p STARTED    658840   720mb 127.0.0.1 qWfkDHn
file_attachments 4 p STARTED         8 204.1kb 127.0.0.1 qWfkDHn
file_attachments 4 r UNASSIGNED                          
file_attachments 3 p STARTED        11 188.3kb 127.0.0.1 qWfkDHn
file_attachments 3 r UNASSIGNED                          
file_attachments 2 p STARTED        12 175.8kb 127.0.0.1 qWfkDHn
file_attachments 2 r UNASSIGNED                          
file_attachments 1 p STARTED        16 253.9kb 127.0.0.1 qWfkDHn
file_attachments 1 r UNASSIGNED                          
file_attachments 0 p STARTED         8 100.6kb 127.0.0.1 qWfkDHn
file_attachments 0 r UNASSIGNED                          
people           0 p STARTED    436342   10.7gb 27.0.0.1 qWfkDHn

The above output was printed during full reindex process (50% into)

People dataset consists around 110K records and it’s the most used.
Single person document may look like this:

https://pastebin.com/PimnrHb1

Elastic was running for weeks with no problems at all. Suddenly search api stopped working due to ES inavailability. Elasticsearch service stopped and latest logs contained:

java.lang.OutOfMemoryError: Java heap space 

I bumped assigned memory up to 6gb in /etc/default/elasticsearch
ES_JAVA_OPTS="-Xms6g -Xmx6g"

Total server memory usage now changed from 8.5g to 12.5g

Elasticsearch service seem to use around 10gb # service elasticsearch status

Unfortunately elasticsearch still shuts down from time to time with the same error. And after starting it up again people shard health shows up as RED.

Search is then unavailable

[2018-05-15T14:15:58,678][WARN ][r.suppressed             ] path: /people/person/_search, params: {index=people, type=person}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed 

Other exceptions I saw during that time

java.util.concurrent.RejectedExecutionException: event executor terminated

org.elasticsearch.node.NodeClosedException: node closed {qWfkDHn}{qWfkDHnQR7mookwia6CnVA}{PQXCk-8sRFi-mlSlLrAQAQ}{127.0.0.1}{127.0.0.1:9300}

[2018-05-15T14:16:52,409][WARN ][o.e.i.e.Engine           ] [qWfkDHn] [people][0] tried to fail engine but engine is already failed. ignoring. [failed to recover from translog]
org.elasticsearch.index.engine.EngineException: failed to recover from translog

org.elasticsearch.action.UnavailableShardsException: [people][0] primary shard is not active Timeout

[2018-05-15T14:16:52,411][WARN ][o.e.i.c.IndicesClusterStateService] [qWfkDHn] [[people][0]] marking and sending shard failed due to [shard failure, reason [index]]
java.lang.ArrayIndexOutOfBoundsException: -65536

It seems like sometimes it’s able to recover from that state, reindexing also helps.

How should I approach this problem? Is there any way to optimize my indeces or should I get more memory? I truly lack any intuition in this case and I don’t understand why it would ran out of memory.

Thank you very much! :slight_smile:

PS. I just noticed that people index has only 1 shard. Would switching 5 shards help?

I realized that there is no point of having any replicas when elastic runs on only 1 node. I will change my replica count to 0.

# curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
  "index" : "personal_emails",
  "shard" : 4,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2018-05-16T05:45:56.593Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "qWfkDHnQR7mookwia6CnVA",
      "node_name" : "qWfkDHn",
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[personal_emails][4], node[qWfkDHnQR7mookwia6CnVA], [P], s[STARTED], a[id=FUXMqAwfRVWfAKvg3lJwDw]]"
        }
      ]
    }
  ]
}

Dont think that will solve java.lang.OutOfMemoryError: Java heap space issue though.

Can you post the entire OOM error section.

[2018-05-11T19:41:28,278][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[qWfkDHn][refresh][T#2]], exiting
java.lang.OutOfMemoryError: Java heap space
	at org.apache.lucene.util.ByteBlockPool$DirectTrackingAllocator.getByteBlock(ByteBlockPool.java:103) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.util.ByteBlockPool.nextBuffer(ByteBlockPool.java:203) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.util.ByteBlockPool.allocSlice(ByteBlockPool.java:258) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:201) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:219) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:180) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:183) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:447) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:273) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:433) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1384) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.elasticsearch.index.engine.InternalEngine.update(InternalEngine.java:736) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:601) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:505) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:556) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:545) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:484) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:143) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:113) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:69) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:939) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:908) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:113) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:322) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:264) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:888) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:885) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147) ~[elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1657) ~[elasticsearch-5.5.0.jar:5.5.0]
[2018-05-11T19:42:12,148][INFO ][o.e.n.Node               ] [] initializing ...
[2018-05-11T19:41:28,262][ERROR][o.e.i.e.Engine           ] [qWfkDHn] [people][0] tragic event in index writer
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.IndexWriter.incRefDeleter(IndexWriter.java:5014) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:117) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:467) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:266) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:256) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:104) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:906) [elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632) [elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:690) [elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.IndexService.access$400(IndexService.java:92) [elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:832) [elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:743) [elasticsearch-5.5.0.jar:5.5.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.5.0.jar:5.5.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.OutOfMemoryError: Java heap space

Latest logs of the day contained many of these, i guess it shut down after that.

Sorry to bother you. Do you have any clue about what can be wrong?

Is there anything before that as well>

Almost entire log of the day.

Two things jump out at me;

[2018-05-12T07:12:12,271][WARN ][o.e.m.j.JvmGcMonitorService] [qWfkDHn] [gc][57] overhead, spent [1.5s] collecting in the last [1.6s]
[2018-05-12T07:12:12,274][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer

You have very high GC there which is a problem.

Are you able to upgrade to 5.6.9?

I'll do it then.

Does 5.6.9 do better job in terms of GC?
Just asking because elasticsearch is pretty neat creature and want to learn more about.
I don't see anything about it in https://www.elastic.co/guide/en/elasticsearch//reference/5.6/release-notes-5.6.9.html

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.