Heapdump - out of memory java heap space


(Akash Narone) #1

Greetings

We are running a 3 node cluster(version-1.4.4) (all masters) with 30G heap with JRE 1.7.80 and replication as 1.
Data: 2.5TB(live)
Total: 25TB
shard per index is 3
daily data is around 70-80GB
indexes are created everyday

We are having frequent GCs running into hours sometimes making the cluster unstable every week

output of jmap histo of one of the servers - https://pastebin.com/8gcQySx2

Can someone point me in the right direction

Any help is appreciated
Thanks in advance


(David Pilato) #2

How many shards in total do you have?

You definitely need to upgrade anyway.


(Akash Narone) #3

No of primary shards is around 120
Sorry to say but upgrading is not a option in the current situation


(Akash Narone) #4

Can some one explain the top 5-6 items using large chunk of memory mean in terms of elasticsearch given in the jmap -histo
Link - https://pastebin.com/8gcQySx2


(Christian Dahlqvist) #5

Doc values were introduced early in the Elasticsearch 1.x series, but were initially quite slow. This improved over time and performance improved so much by version 1.7 that we could make them enabled by default in Elasticsearch 2.0.

If you have fields that you need to aggregate on but not perform free-text search on, you can save a lot of heap space by mapping these using doc_values.

What does your data and mappings look like? Are you using doc values?


(Akash Narone) #6

No we are not using Doc values,

A daily index typically looks like:
https://pastebin.com/EJUFpxqK

There is only one type in daily indexes and we don't aggregate on analyzed fields
Data we capture are the logs generated from the devices like firewall, windows, linux etc


(Christian Dahlqvist) #7

Not analysed fields are still kept on the heap unless you use doc values, so I suspect that is one thing that is driving your heap usage.

I recall heap pressure dropping significantly when users switched to doc_values, allowing much large data volumes to be stored per node. If I remember correctly though (this was a long time ago as this is a very old version), doc values requires the use of aggregations rather than facets, so Kibana3 may not work with doc values.


(Akash Narone) #8

Currently on one of the nodes the heap is 65 percent and the fileddata is 1.4 gb and this remains true to all of the nodes.
We have even added scripts to clear cache once it reaches 80-85 percent


(Christian Dahlqvist) #9

I would like David recommend upgrading. I do not really remember enough to have any other suggestions at this point.


(David Pilato) #10

If you don't want to upgrade another short term solution is to start new nodes.


(Akash Narone) #11

Ok , I will keep that in mind.
Meanwhile is there anything we can do to make the occurrence of the heap issue less frequent other than adding nodes


(David Pilato) #12

Some ideas:

  • Don't run aggregations
  • Don't sort
  • Remove old indices

(Akash Narone) #13

Ok, one last question do you recommend to enable doc values for ES 1.4.4 to resolve or to delay the heap issue,


(Christian Dahlqvist) #14

How are you querying your data? Kibana3? Kibana4? Aggregations? Facets? Just searches?


(Akash Narone) #15

aggregation and searches with sorts using python


(Akash Narone) #16

Automated queries don't execute for more than 24 hrs of the old data, so maximum 2 indexes will be queried


(Christian Dahlqvist) #17

If you are not using facets you should (as far as I can recall) be able to switch to using doc values for not analysed fields and save on heap space. If this causes performance problems, it could be worthwhile upgrading to version 1.7.6 as there were a number of performance improvements introduced throughout the series.


(Akash Narone) #18

Ok thank you for the thumps up and we are not using facets, I will enable the doc values and keep monitoring. If it only takes more time to give back the result its not an issue for me.
Going by yours and David's suggestion we are planning it to upgrade to 2.x series


(Christian Dahlqvist) #19

I would recommend testing it properly first though, as my memory stretching this far back is a bit hazy...


(Akash Narone) #20

Ok, I will test it and only then will apply it on the production