ES 7.11 Heap Out of memory error

Hi !

We upgraded from 2.4 to 7.11, we are having below issues all around the ES.

  1. ES data nodes are going down alternatively, and it is due to Heap Out of memory issues.
  2. Today morning we did re-indexing around 100k/s and we were good at that instance. But at one point of time we cannot do indexing anymore. All our queues are redelivered due to Max ES timeout exceptions.

We have 50 data nodes.

Any thoughts on what may be causing this would be appreciated !!

Log:

Data node Log before it went down. 

[2021-06-30T13:54:22,381][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [
xxxx
] fatal error in thread [elasticsearch[
xxx
][search][T#4]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.packed.Packed8ThreeBlocks.<init>(Packed8ThreeBlocks.java:41) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
at org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:965) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
at org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:941) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
at org.apache.lucene.util.packed.GrowableWriter.ensureCapacity(GrowableWriter.java:80) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
at org.apache.lucene.util.packed.GrowableWriter.set(GrowableWriter.java:88) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
at org.elasticsearch.index.fielddata.ordinals.OrdinalsBuilder$OrdinalsStore.firstLevel(OrdinalsBuilder.java:176) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.index.fielddata.ordinals.OrdinalsBuilder$OrdinalsStore.addOrdinal(OrdinalsBuilder.java:167) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.index.fielddata.ordinals.OrdinalsBuilder.addDoc(OrdinalsBuilder.java:312) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData.loadDirect(PagedBytesIndexFieldData.java:136) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData.loadDirect(PagedBytesIndexFieldData.java:47) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.lambda$load$0(IndicesFieldDataCache.java:135) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache$Lambda$6309/0x0000000801c75ec8.load(Unknown Source) ~[?:?]
at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:423) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.load(IndicesFieldDataCache.java:132) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.index.fielddata.plain.AbstractIndexOrdinalsFieldData.load(AbstractIndexOrdinalsFieldData.java:82) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.index.fielddata.plain.AbstractIndexOrdinalsFieldData.load(AbstractIndexOrdinalsFieldData.java:33) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.support.ValuesSource$Bytes$WithOrdinals$FieldData.globalOrdinalsValues(ValuesSource.java:208) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.support.ValuesSource$Bytes$WithOrdinals.globalMaxOrd(ValuesSource.java:180) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory.getMaxOrd(TermsAggregatorFactory.java:281) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory.access$000(TermsAggregatorFactory.java:40) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory$1.build(TermsAggregatorFactory.java:85) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory.doCreateInternal(TermsAggregatorFactory.java:230) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.support.ValuesSourceAggregatorFactory.createInternal(ValuesSourceAggregatorFactory.java:36) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:63) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.AggregatorFactories.createSubAggregators(AggregatorFactories.java:187) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.AggregatorBase.<init>(AggregatorBase.java:64) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.<init>(BucketsAggregator.java:47) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.bucket.DeferableBucketAggregator.<init>(DeferableBucketAggregator.java:35) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.bucket.sampler.SamplerAggregator.<init>(SamplerAggregator.java:164) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.bucket.sampler.SamplerAggregatorFactory.createInternal(SamplerAggregatorFactory.java:33) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:63) ~[elasticsearch-7.11.2.jar:7.11.2]
at org.elasticsearch.search.aggregations.AggregatorFactories.createSubAggregators(AggregatorFactories.java:187) ~[elasticsearch-7.11.2.jar:7.11.2]
(END)
Error 2 : The remote server returned an error: (429) Too Many Requests.. Call: Status code 429 from: POST /feed_XXXX/_update/56246%7C1%7C1409625889747439616?if_primary_term=1&if_seq_no=3102960. ServerError: Type: circuit_breaking_exception Reason: "[parent] Data too large, data for [indices:data/write/update[s]] would be [17041414328/15.8gb], which is larger than the limit of [16320875724/15.1gb], real usage: [17041410408/15.8gb], new bytes reserved: [3920/3.8kb], usages [request=0/0b, fielddata=2388112750/2.2gb, in_flight_requests=5126/5kb, model_inference=0/0b, accounting=4148192/3.9mb]"

Configuration:

ES version : "7.11.2",

ES Configuration:

**elasticsearch.yml**

bootstrap.memory_lock: true

cloud.node.auto_attributes: true

cluster:

name: XXXXXXX

routing.allocation.awareness.attributes: aws_availability_zone

discovery:

seed_providers: ec2

ec2.groups: XXXXX

network.host: XX.XXX.X.XXX

node:

name: ${HOSTNAME}

roles : [ data ]

http.max_content_length: 200mb

siren.memory.root.limit: 2147483647

**Jvm Options.yml**

# Xmx represents the maximum size of total heap space

-Xms16g

-Xmx16g

-Dsiren.io.netty.maxDirectMemory=2147483648

## GC configuration

-Des.networkaddress.cache.ttl=60

-Des.networkaddress.cache.negative.ttl=10

# pre-touch memory pages used by the JVM during initialization

-XX:+AlwaysPreTouch

# explicitly set the stack size

-Xss1m

# set to headless, just in case

-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)

-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one

-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common

# exceptions because stack traces are important for debugging

-XX:-OmitStackTraceInFastThrow

# flags to configure Netty

-Dio.netty.noUnsafe=true

-Dio.netty.noKeySetOptimization=true

-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2

-Dlog4j.shutdownHookEnabled=false

-Dlog4j2.disable.jmx=true
-Djava.io.tmpdir=${ES_TMPDIR}

-XX:-HeapDumpOnOutOfMemoryError

-XX:HeapDumpPath=/var/lib/elasticsearch

-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

8:-XX:+PrintGCDetails

8:-XX:+PrintGCDateStamps

8:-XX:+PrintTenuringDistribution

8:-XX:+PrintGCApplicationStoppedTime

8:-Xloggc:/var/log/elasticsearch/gc.log

8:-XX:+UseGCLogFileRotation

8:-XX:NumberOfGCLogFiles=32

8:-XX:GCLogFileSize=64m

#JDK 9+ GC logging

9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

9-:-Djava.locale.providers=COMPAT

Thanks!!

What is the output from the _cluster/stats?pretty&human API?

Can you elaborate more on your upgrade procedure?

I would suggest removing Siren and seeing if that helps. Mostly because that's a non-supported plugin and it may have unintended consequences.

Thanks - we figured the issue. All good

1 Like

It'd be good to share your solution, it might help someone in the future.

Bulk index request timed out because ES didn't respond when the targeted index alias wasn't present. When we added the missing alias, it worked

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.