Please suggest performance settings for my ElasticSearch cluster


(Nikolay) #1

Hi, I have troubles with my ElasticSearch cluster and don't know what to do. Here my settings:

Server:
64GB of RAM
8 cores
500GB SSD

Data:
10 different index each 1M rows
10M of rows
11GB of data

Actual ES cluster
3 nodes on the same server
1 master node and 2 slave nodes
8g Heap size for each node
bootstrap.mlockall: true

Search requests:
Search Rate: 150 /s
Search Latency: 1ms
I have a lot of aggregations and filter for each search request

Problem:
When the JVM heap is coming over 90% the nodes not responding anymore. When I restart them, then everything works fine until the next 3 days, where the heap is coming again to 90% and the cluster not respond.

Here is the graph when the heap is > 90%

What I need:

  • Can someone suggest me settings for my elasticsearch.yml so I can handle in a good way the cache based on the settings above
  • What to do with the <90% JVM heap problem

Thanks
Nik


ElasticSearch nodes not responding anymore - please help!
(Christian Dahlqvist) #2

Why are you running multiple nodes on the server when you can simply run a single node with 30GB of heap?


(Nikolay) #3

I try it, but the clean of the garbage collection has took longer and there were timeouts when the GC has start to clean.


(Christian Dahlqvist) #4

What does your node configuration look like? Do you have any custom settings or are you running with the defaults?


(Nikolay) #5

This are my settings:

cluster.name: XXXX
node.name: XXX-master
node.data: true
node.master: true

bootstrap.mlockall: true
index.merge.policy.merge_factor: 5

# Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 1000
threadpool.bulk.queue_size: 30000

# Index pool
threadpool.index.type: fixed
threadpool.index.size: 1000
threadpool.index.queue_size: 10000

#search pool
threadpool.search.queue_size: 4000

index.cache.query.enable: true
index.requests.cache.enable: true
indices.cache.query.size: 25%

indices.fielddata.cache.size: 25%
indices.cache.filter.size: 25%
transport.tcp.compress: true;

#index
index.store.type: mmapfs

network.bind_host: xxx.xxx.xxx.xxx
network.publish_host: xxx.xxx.xxx.xxx
network.host: xxx.xxx.xxx.xxx
discovery.zen.ping.unicast.hosts: ["xxx.xxx.xxx.xxx"]
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.timeout: 10s
transport.tcp.port: 9300
http.port: 9200
http.max_content_length: 500mb
index.routing.allocation.disable_allocation: false

index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

script.engine.groovy.inline.aggs: on
script.inline: on
script.indexed: on
index.max_result_window: 10000

When the JVM heap is high I start to become in the log files such messages:

[2016-09-11 09:03:54,738][WARN ][transport ] [XXX-master-1] Transport response handler not found of id [16262715]


(Mark Walkom) #6

No no no no no.
All of this is in memory and having such ridiculously large settings is going to add more heap pressure.


(Nikolay) #7

To explain the process:

As I say I have 10 index, each one 1M rows. All this index has job ads for 10 different countries. 1 index one county. Every day once a day in different times I rebuild each index. This means I create a new one index for the selected country, fill the data into it and remove the old index.

Should I remove the settings above?


(Christian Dahlqvist) #8

Yes. Start with the default settings and then start tweaking from that point if necessary. More is not always better when it comes to these settings and the defaults are generally very good in my experience.


(Nikolay) #9

Ok I will remove them. What about the cache settings? Are they ok? What do you suggest?


(Christian Dahlqvist) #10

Unless you have reached these through systematic testing and evaluation, I would recommend starting with all default settings. Whether those cache settings are right or not for you use case is impossible for me to tell.


(Nikolay) #11

The question ist. I don't know it. Should the JVM heam reach 90% in good optimized node or not? What is a good avg % for the heap ?
Why my cluster die when the JVM heap is > 90% I can not undestand the problem.


(Christian Dahlqvist) #12

Does it reach 90% after you have removed your custom settings?


(Nikolay) #13

Yes! :frowning: Every 2 days the JVM heap goes over 90% and the nodes dies.

Some suggestions what to do? Do you need some stats, I can deliver it.


(Christian Dahlqvist) #14

Can you please post your full current configuration? What does Marvel show with respect to heap usage over these 2 days before it reaches 90%?


(Kim Kruse Hansen) #15

I have suffered from the exact same problem. I have trying out various heap sizes, starting from 8 to 12 to 16 to 20 and none was sufficient, Each node has to be restarted every 2 days or so. Extremely long gc old , in the range of 2 minutes or more.

My latest experiment is to maximime heap to 30 GB and cluster is now on day 5. So this has definitely helped , but I am still seeing a small growth in heap usage. So sooner or later , I will probably have to restart the nodes.

I am also monitoring heap usage , if consistently over 90% , a script will restart the node automatically.


(Nikolay) #16

HI Kim, this is a very bad solution :slight_smile: I don't want to have this stress every day. This is impossible! It is some setting, but we haven't found it.

For example I have another project, where the index is only 2GB big, there are 30-40 reuquest per second and the server and heap ist working fine. Here I have one index only.

But on this problematic project I have 10 indexes , where everyone has 1GB of data. The heap is coming very fast over 90% and the nodes are not responding anymore.

I hope someone from elasticsearch team can help us to solve this problem!


(Nikolay) #17

Hi Christian, here is the data:


The settings of the node:

cluster.name: xxxx
node.name: xxxx-master
node.data: true
node.master: true

bootstrap.mlockall: true
index.merge.policy.merge_factor: 5

threadpool.index.queue_size: 10000
index.cache.query.enable: true

transport.tcp.compress: true;
index.store.type: mmapfs

network.bind_host: xxx.xxx.xxx.xxx
network.publish_host: xxx.xxx.xxx.xxx
network.host: xxx.xxx.xxx.xxx
discovery.zen.ping.unicast.hosts: ["xxx.xxx.xxx.xxx"]
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.timeout: 10s
transport.tcp.port: 9300
http.port: 9200
http.max_content_length: 500mb
index.routing.allocation.disable_allocation: false

index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

script.engine.groovy.inline.aggs: on
script.inline: on
script.indexed: on
index.max_result_window: 10000

As you suggest me, I have removed all the additional nodes and make only one node with 30GB of heap. This doesn't help! I become still after one day the 90% of the heap full and the node die. Not resonding anymore.

I have now 3x more heap space then my index is big.

Suggestion?

Thanks
Nik


(Mark Walkom) #18

You look to have too many shards, nearly 200 for only 6GB of data is going to be wasting a lot of resources.

What is in your slow log?

You should remove all of those, they are either pointless or dangerous.


(Nikolay) #19

Hi, where do you have seen this 200 ? I have actually 8 shards for each index.

Is the index.store.type: mmapfs not the best choise for an elasticsearch index?

Thanks
Nik


(Mark Walkom) #20

It's in the first picture you posted, in Marvel/Monitoring).

Have a read of https://www.elastic.co/guide/en/elasticsearch/reference/2.4/index-modules-store.html#file-system