ES OOMing and not triggering cache circuit breakers, using LocalManualCache

Wilfred_Hughes · February 11, 2015, 4:29pm

Hi all

I have an ES 1.2.4 cluster which is occasionally running out of heap. I
have ES_HEAP_SIZE=31G and according to the heap dump generated, my biggest
memory users were:

org.elasticsearch.common.cache.LocalCache$LocalManualCache 55%
org.elasticsearch.indices.cache.filter.IndicesFilterCache 11%

and nothing else used more than 1%.

It's not clear to me what this cache is. I can't find any references to
ManualCache in the elasticsearch source code, and the docs:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/index-modules-fielddata.html
suggest to me that the circuit breakers should stop requests or reduce
cache usage rather that OOMing.

At the moment my cache was filled up, the node was actually trying to index
some data:

[2015-02-11 08:14:29,775][WARN ][index.translog ] [data-node-2]
[logstash-2015.02.11][0] failed to flush shard on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[logstash-2015.02.11][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2015-02-11 08:14:29,812][DEBUG][action.bulk ] [data-node-2]
[logstash-2015.02.11][0] failed to execute bulk item (index) index
{[logstash-2015.02.11][syslog_slurm][1
org.elasticsearch.index.engine.CreateFailedEngineException:
[logstash-2015.02.11][0] Create failed for
[syslog_slurm#12UUWk5mR_2A1FGP5W3_1g]
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:393)
at
org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:384)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:430)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:433)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.util.fst.BytesStore.writeByte(BytesStore.java:83)
at org.apache.lucene.util.fst.FST.(FST.java:286)
at org.apache.lucene.util.fst.Builder.(Builder.java:163)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$PendingBlock.compileIndex(BlockTreeTermsWriter.java:422)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:572)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter$FindBlocks.freeze(BlockTreeTermsWriter.java:547)
at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:214)
at org.apache.lucene.util.fst.Builder.add(Builder.java:394)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1039)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:548)
at
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:116)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
at
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81)
at
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:465)
at
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:518)
at
org.apache.lucene.index.DocumentsWriter.preUpdate(DocumentsWriter.java:368)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1537)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1207)
at
org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:459)
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:386)
... 8 more

Could anyone clarify as to what this cache is, or point me towards some
docs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f69f0bf1-fe55-4832-9a81-e641851cb4bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wilfred_Hughes · February 11, 2015, 5:50pm

After examining some other nodes that were using a lot of their heap, I
think this is actually field data cache:

$ curl "http://localhost:9200/_cluster/stats?human&pretty"
...
"fielddata": {
"memory_size": "21.3gb",
"memory_size_in_bytes": 22888612852,
"evictions": 0
},
"filter_cache": {
"memory_size": "6.1gb",
"memory_size_in_bytes": 6650700423,
"evictions": 12214551
},

Since this is storing logstash data, I'm going to add the following lines
to my elasticsearch.yml and see if I observe a difference once deployed to
production.

Don't hold field data caches for more than a day, since data is

grouped by day and we quickly lose interest in historical data.

indices.fielddata.cache.expire: "1d"

On Wednesday, 11 February 2015 16:29:22 UTC, Wilfred Hughes wrote:

Hi all

I have an ES 1.2.4 cluster which is occasionally running out of heap. I
have ES_HEAP_SIZE=31G and according to the heap dump generated, my biggest
memory users were:

org.elasticsearch.common.cache.LocalCache$LocalManualCache 55%
org.elasticsearch.indices.cache.filter.IndicesFilterCache 11%

and nothing else used more than 1%.

It's not clear to me what this cache is. I can't find any references to
ManualCache in the elasticsearch source code, and the docs:
Elasticsearch Platform — Find real-time answers at scale | Elastic
suggest to me that the circuit breakers should stop requests or reduce
cache usage rather that OOMing.

At the moment my cache was filled up, the node was actually trying to
index some data:

[2015-02-11 08:14:29,775][WARN ][index.translog ] [data-node-2]
[logstash-2015.02.11][0] failed to flush shard on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[logstash-2015.02.11][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at
org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2015-02-11 08:14:29,812][DEBUG][action.bulk ] [data-node-2]
[logstash-2015.02.11][0] failed to execute bulk item (index) index
{[logstash-2015.02.11][syslog_slurm][1
org.elasticsearch.index.engine.CreateFailedEngineException:
[logstash-2015.02.11][0] Create failed for
[syslog_slurm#12UUWk5mR_2A1FGP5W3_1g]
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:393)
at
org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:384)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:430)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:433)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.util.fst.BytesStore.writeByte(BytesStore.java:83)
at org.apache.lucene.util.fst.FST.(FST.java:286)
at org.apache.lucene.util.fst.Builder.(Builder.java:163)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$PendingBlock.compileIndex(BlockTreeTermsWriter.java:422)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:572)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter$FindBlocks.freeze(BlockTreeTermsWriter.java:547)
at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:214)
at org.apache.lucene.util.fst.Builder.add(Builder.java:394)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1039)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:548)
at
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:116)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
at
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81)
at
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:465)
at
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:518)
at
org.apache.lucene.index.DocumentsWriter.preUpdate(DocumentsWriter.java:368)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1537)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1207)
at
org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:459)
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:386)
... 8 more

Could anyone clarify as to what this cache is, or point me towards some
docs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/de2c0dd4-f7b5-4ba6-8cf7-a488888e3cff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

polyfractal · February 11, 2015, 6:57pm

LocalManualCache is a component of Guava's LRU cache
https://code.google.com/p/guava-libraries/source/browse/guava-gwt/src-super/com/google/common/cache/super/com/google/common/cache/CacheBuilder.java,
which is used by Elasticsearch for both the filter and field data cache.
Based on your node stats, I'd agree it is the field data usage which is
causing your OOMs. CircuitBreaker helps prevent OOM, but it works on a
per-request basis. It's possible for individual requests to pass the CB
because they use small subsets of fields, but over-time the set of fields
loaded into Field Data continues to grow and you'll OOM anyway.

I would prefer to set a field data limit, rather than an expiration. A
hard limit prevents OOM because you don't allow the cache to grow anymore.
An expiration does not guarantee that, since you could get a burst of
activity that still fills up the heap and OOMs before the expiration can
work.

-Z

On Wednesday, February 11, 2015 at 12:50:45 PM UTC-5, Wilfred Hughes wrote:

After examining some other nodes that were using a lot of their heap, I
think this is actually field data cache:

$ curl "http://localhost:9200/_cluster/stats?human&pretty"
...
"fielddata": {
"memory_size": "21.3gb",
"memory_size_in_bytes": 22888612852,
"evictions": 0
},
"filter_cache": {
"memory_size": "6.1gb",
"memory_size_in_bytes": 6650700423,
"evictions": 12214551
},

Since this is storing logstash data, I'm going to add the following lines
to my elasticsearch.yml and see if I observe a difference once deployed to
production.

Don't hold field data caches for more than a day, since data is

grouped by day and we quickly lose interest in historical data.

indices.fielddata.cache.expire: "1d"

On Wednesday, 11 February 2015 16:29:22 UTC, Wilfred Hughes wrote:

Hi all

I have an ES 1.2.4 cluster which is occasionally running out of heap. I
have ES_HEAP_SIZE=31G and according to the heap dump generated, my biggest
memory users were:

org.elasticsearch.common.cache.LocalCache$LocalManualCache 55%
org.elasticsearch.indices.cache.filter.IndicesFilterCache 11%

and nothing else used more than 1%.

It's not clear to me what this cache is. I can't find any references to
ManualCache in the elasticsearch source code, and the docs:
Elasticsearch Platform — Find real-time answers at scale | Elastic
suggest to me that the circuit breakers should stop requests or reduce
cache usage rather that OOMing.

At the moment my cache was filled up, the node was actually trying to
index some data:

[2015-02-11 08:14:29,775][WARN ][index.translog ] [data-node-2]
[logstash-2015.02.11][0] failed to flush shard on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[logstash-2015.02.11][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at
org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2015-02-11 08:14:29,812][DEBUG][action.bulk ] [data-node-2]
[logstash-2015.02.11][0] failed to execute bulk item (index) index
{[logstash-2015.02.11][syslog_slurm][1
org.elasticsearch.index.engine.CreateFailedEngineException:
[logstash-2015.02.11][0] Create failed for
[syslog_slurm#12UUWk5mR_2A1FGP5W3_1g]
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:393)
at
org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:384)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:430)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:433)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.util.fst.BytesStore.writeByte(BytesStore.java:83)
at org.apache.lucene.util.fst.FST.(FST.java:286)
at org.apache.lucene.util.fst.Builder.(Builder.java:163)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$PendingBlock.compileIndex(BlockTreeTermsWriter.java:422)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:572)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter$FindBlocks.freeze(BlockTreeTermsWriter.java:547)
at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:214)
at org.apache.lucene.util.fst.Builder.add(Builder.java:394)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1039)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:548)
at
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:116)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
at
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81)
at
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:465)
at
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:518)
at
org.apache.lucene.index.DocumentsWriter.preUpdate(DocumentsWriter.java:368)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1537)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1207)
at
org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:459)
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:386)
... 8 more

Could anyone clarify as to what this cache is, or point me towards some
docs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/458a1101-5f31-40a9-8f42-7cd95d9dfc2b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wilfred_Hughes · February 12, 2015, 3:15pm

Oh, is field data per-node or total across the cluster? I grabbed a test
cluster with two data nodes, and I deliberately set fielddata really low:

indices.fielddata.cache.size: "100mb"

However, after a few queries, I'm seeing more than 100MiB in use:

$ curl "http://localhost:9200/_cluster/stats?human&pretty"
...
"fielddata": {
"memory_size": "119.7mb",
"memory_size_in_bytes": 125543995,
"evictions": 0
},

Is this expected?

On Wednesday, 11 February 2015 18:57:28 UTC, Zachary Tong wrote:

LocalManualCache is a component of Guava's LRU cache
https://code.google.com/p/guava-libraries/source/browse/guava-gwt/src-super/com/google/common/cache/super/com/google/common/cache/CacheBuilder.java,
which is used by Elasticsearch for both the filter and field data cache.
Based on your node stats, I'd agree it is the field data usage which is
causing your OOMs. CircuitBreaker helps prevent OOM, but it works on a
per-request basis. It's possible for individual requests to pass the CB
because they use small subsets of fields, but over-time the set of fields
loaded into Field Data continues to grow and you'll OOM anyway.

I would prefer to set a field data limit, rather than an expiration. A
hard limit prevents OOM because you don't allow the cache to grow anymore.
An expiration does not guarantee that, since you could get a burst of
activity that still fills up the heap and OOMs before the expiration can
work.

-Z

On Wednesday, February 11, 2015 at 12:50:45 PM UTC-5, Wilfred Hughes wrote:

After examining some other nodes that were using a lot of their heap, I
think this is actually field data cache:

$ curl "http://localhost:9200/_cluster/stats?human&pretty"
...
"fielddata": {
"memory_size": "21.3gb",
"memory_size_in_bytes": 22888612852,
"evictions": 0
},
"filter_cache": {
"memory_size": "6.1gb",
"memory_size_in_bytes": 6650700423,
"evictions": 12214551
},

Since this is storing logstash data, I'm going to add the following lines
to my elasticsearch.yml and see if I observe a difference once deployed to
production.

Don't hold field data caches for more than a day, since data is

grouped by day and we quickly lose interest in historical data.

indices.fielddata.cache.expire: "1d"

On Wednesday, 11 February 2015 16:29:22 UTC, Wilfred Hughes wrote:

Hi all

I have an ES 1.2.4 cluster which is occasionally running out of heap. I
have ES_HEAP_SIZE=31G and according to the heap dump generated, my biggest
memory users were:

org.elasticsearch.common.cache.LocalCache$LocalManualCache 55%
org.elasticsearch.indices.cache.filter.IndicesFilterCache 11%

and nothing else used more than 1%.

It's not clear to me what this cache is. I can't find any references to
ManualCache in the elasticsearch source code, and the docs:
Elasticsearch Platform — Find real-time answers at scale | Elastic
suggest to me that the circuit breakers should stop requests or reduce
cache usage rather that OOMing.

At the moment my cache was filled up, the node was actually trying to
index some data:

[2015-02-11 08:14:29,775][WARN ][index.translog ]
[data-node-2] [logstash-2015.02.11][0] failed to flush shard on translog
threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[logstash-2015.02.11][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at
org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2015-02-11 08:14:29,812][DEBUG][action.bulk ]
[data-node-2] [logstash-2015.02.11][0] failed to execute bulk item (index)
index {[logstash-2015.02.11][syslog_slurm][1
org.elasticsearch.index.engine.CreateFailedEngineException:
[logstash-2015.02.11][0] Create failed for
[syslog_slurm#12UUWk5mR_2A1FGP5W3_1g]
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:393)
at
org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:384)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:430)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:433)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.util.fst.BytesStore.writeByte(BytesStore.java:83)
at org.apache.lucene.util.fst.FST.(FST.java:286)
at org.apache.lucene.util.fst.Builder.(Builder.java:163)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$PendingBlock.compileIndex(BlockTreeTermsWriter.java:422)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:572)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter$FindBlocks.freeze(BlockTreeTermsWriter.java:547)
at
org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:214)
at org.apache.lucene.util.fst.Builder.add(Builder.java:394)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1039)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:548)
at
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:116)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
at
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81)
at
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:465)
at
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:518)
at
org.apache.lucene.index.DocumentsWriter.preUpdate(DocumentsWriter.java:368)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1537)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1207)
at
org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:459)
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:386)
... 8 more

Could anyone clarify as to what this cache is, or point me towards some
docs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c5eb4870-6e4c-4992-9bcc-229f10291228%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wilfred_Hughes · February 19, 2015, 9:30am

After some experimentation, I believe _cluster/stats shows the total field
data across the whole cluster. I manged to push my test cluster to 198MiB
field data cache usage.

As a result, based on Zachary's feedback, I've set the following values in
my elasticsearch.yml:

indices.fielddata.cache.size: "15gb"
indices.fielddata.cache.expire: "7d"

On Thursday, 12 February 2015 15:15:32 UTC, Wilfred Hughes wrote:

Oh, is field data per-node or total across the cluster? I grabbed a test
cluster with two data nodes, and I deliberately set fielddata really low:

indices.fielddata.cache.size: "100mb"

However, after a few queries, I'm seeing more than 100MiB in use:

$ curl "http://localhost:9200/_cluster/stats?human&pretty"
...
"fielddata": {
"memory_size": "119.7mb",
"memory_size_in_bytes": 125543995,
"evictions": 0
},

Is this expected?

On Wednesday, 11 February 2015 18:57:28 UTC, Zachary Tong wrote:

LocalManualCache is a component of Guava's LRU cache
https://code.google.com/p/guava-libraries/source/browse/guava-gwt/src-super/com/google/common/cache/super/com/google/common/cache/CacheBuilder.java,
which is used by Elasticsearch for both the filter and field data cache.
Based on your node stats, I'd agree it is the field data usage which is
causing your OOMs. CircuitBreaker helps prevent OOM, but it works on a
per-request basis. It's possible for individual requests to pass the CB
because they use small subsets of fields, but over-time the set of fields
loaded into Field Data continues to grow and you'll OOM anyway.

I would prefer to set a field data limit, rather than an expiration. A
hard limit prevents OOM because you don't allow the cache to grow anymore.
An expiration does not guarantee that, since you could get a burst of
activity that still fills up the heap and OOMs before the expiration can
work.

-Z

On Wednesday, February 11, 2015 at 12:50:45 PM UTC-5, Wilfred Hughes
wrote:

After examining some other nodes that were using a lot of their heap, I
think this is actually field data cache:

$ curl "http://localhost:9200/_cluster/stats?human&pretty"
...
"fielddata": {
"memory_size": "21.3gb",
"memory_size_in_bytes": 22888612852,
"evictions": 0
},
"filter_cache": {
"memory_size": "6.1gb",
"memory_size_in_bytes": 6650700423,
"evictions": 12214551
},

Since this is storing logstash data, I'm going to add the following
lines to my elasticsearch.yml and see if I observe a difference once
deployed to production.

Don't hold field data caches for more than a day, since data is

grouped by day and we quickly lose interest in historical data.

indices.fielddata.cache.expire: "1d"

On Wednesday, 11 February 2015 16:29:22 UTC, Wilfred Hughes wrote:

Hi all

I have an ES 1.2.4 cluster which is occasionally running out of heap. I
have ES_HEAP_SIZE=31G and according to the heap dump generated, my biggest
memory users were:

org.elasticsearch.common.cache.LocalCache$LocalManualCache 55%
org.elasticsearch.indices.cache.filter.IndicesFilterCache 11%

and nothing else used more than 1%.

It's not clear to me what this cache is. I can't find any references to
ManualCache in the elasticsearch source code, and the docs:
Elasticsearch Platform — Find real-time answers at scale | Elastic
suggest to me that the circuit breakers should stop requests or reduce
cache usage rather that OOMing.

At the moment my cache was filled up, the node was actually trying to
index some data:

[2015-02-11 08:14:29,775][WARN ][index.translog ]
[data-node-2] [logstash-2015.02.11][0] failed to flush shard on translog
threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[logstash-2015.02.11][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at
org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2015-02-11 08:14:29,812][DEBUG][action.bulk ]
[data-node-2] [logstash-2015.02.11][0] failed to execute bulk item (index)
index {[logstash-2015.02.11][syslog_slurm][1
org.elasticsearch.index.engine.CreateFailedEngineException:
[logstash-2015.02.11][0] Create failed for
[syslog_slurm#12UUWk5mR_2A1FGP5W3_1g]
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:393)
at
org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:384)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:430)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:433)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.util.fst.BytesStore.writeByte(BytesStore.java:83)
at org.apache.lucene.util.fst.FST.(FST.java:286)
at org.apache.lucene.util.fst.Builder.(Builder.java:163)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$PendingBlock.compileIndex(BlockTreeTermsWriter.java:422)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:572)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter$FindBlocks.freeze(BlockTreeTermsWriter.java:547)
at
org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:214)
at org.apache.lucene.util.fst.Builder.add(Builder.java:394)
at
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1039)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:548)
at
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:116)
at
org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
at
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81)
at
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:465)
at
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:518)
at
org.apache.lucene.index.DocumentsWriter.preUpdate(DocumentsWriter.java:368)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1537)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1207)
at
org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:459)
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:386)
... 8 more

Could anyone clarify as to what this cache is, or point me towards some
docs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/14507f03-2b79-4de1-9ca5-9b7201c02cec%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

lisak · June 24, 2015, 3:44pm

If you have 35% of Heap allocated for Lucene instances, then circuit breaker total.limit 70% will lead to OOME so you'll need to decrease it to 60% ? Is that correct?

Topic		Replies	Views
GC failing to reduce heap memory usage Elasticsearch	10	813	July 6, 2017
Field cache limits ignored Elasticsearch	9	463	July 6, 2017
Question about ES cluster topologies and frequent OutOfMemoryError Elasticsearch	4	379	July 6, 2017
ElasticSearch OutOfMemory Exceptions Elasticsearch	8	404	July 6, 2017
Yet another memory usage issue - ES 0.90.5 Elasticsearch	18	489	July 6, 2017

ES OOMing and not triggering cache circuit breakers, using LocalManualCache

Don't hold field data caches for more than a day, since data is

grouped by day and we quickly lose interest in historical data.

Don't hold field data caches for more than a day, since data is

grouped by day and we quickly lose interest in historical data.

Don't hold field data caches for more than a day, since data is

grouped by day and we quickly lose interest in historical data.

Don't hold field data caches for more than a day, since data is

grouped by day and we quickly lose interest in historical data.

Related topics