Elastic-Search Heap issues

(Suresh Itha) #1

Hi Team,

We are facing heap issues in elastic search 1.7.3 version for all the data nodes. Please find the below steps what we have implemented in my cluster.

Step-1: We have total 16 data nodes and each node is having 3 instances (data1, data2 and data3) total we have 48 instances and 3 masters+ 16 separate ingest(search) nodes. All the data nodes are bare metals and each node is having 7.1TB disk.
Filesystem Size Used Avail Use% Mounted on
/dev/sdi2 132G 16G 110G 13% /
devtmpfs 252G 0 252G 0% /dev
tmpfs 252G 0 252G 0% /dev/shm
tmpfs 252G 26M 252G 1% /run
tmpfs 252G 0 252G 0% /sys/fs/cgroup
/dev/mapper/Source--ES--eph-volume--367978823--14 7.0T 642G 6.4T 9% /app

Step-2: Please find the ES process configuration

elastic+ 13580 1 48 Dec05 ? 11:42:26 /bin/java -Xms30g -Xmx30g -Djava.awt.headless=true -XX:+UseG1GC -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:+UseCompressedOops -XX:MaxGCPauseMillis=200 -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=20 -XX:+UseStringDeduplication -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=3335 -Des.max-open-files=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/share/elasticsearch/logs/heapdump.hprof -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Delasticsearch -Des.foreground=yes -Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.7.3.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/ -Des.pidfile=/var/run/elasticsearch/ -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/usr/local/var/log/elasticsearch/ -Des.default.path.data=/app/data/elasticsearch/ -Des.default.path.conf=/etc/elasticsearch/data1 org.elasticsearch.bootstrap.Elasticsearch


Additional Java OPTS

es_java_opts: "$ES_JAVA_OPTS -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port= -Des.max-open-files=true",

ES_GC_OPTS="-XX:+UseG1GC -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:+UseCompressedOops -XX:MaxGCPauseMillis=200 -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=20 -XX:+UseStringDeduplication -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=3335 -Des.max-open-files=true"
export ES_GC_OPTS

Step-3: Please find the settings.

action.auto_create_index: true
action.destructive_requires_name: true
action.disable_delete_all_indices: true
bootstrap.mlockall: true
cluster.name: Cluster_name
cluster.routing.allocation.same_shard.host: true
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: master_node1:9301,master_node2:9301,master_node3:9301
http.port: 9202
index.mapper.dynamic: true
index.merge.policy.use_compound_file: false
index.number_of_replicas: 0
index.number_of_shards: 96
index.query.bool.max_clause_count: 10000
index.refresh_interval: 1000s
indices.fielddata.cache.size: 10%
indices.recovery.max_bytes_per_sec: 60mb
node.data: true
node.master: false
script.inline: false
script.stored: false
script.file: false
script.groovy.sandbox.enabled: false
threadpool.bulk.queue_size: 300
threadpool.index.queue_size: 300
transport.tcp.port: 9302
threadpool.bulk.size: 60
threadpool.bulk.type: fixed
threadpool.index.size: 60
threadpool.index.type: fixed
threadpool.search.queue_size: 400
threadpool.search.size: 60
threadpool.search.type: fixed
discovery.zen.fd.ping_timeout: 180s
discovery.zen.fd.ping_interval: 60s
discovery.zen.fd.ping_retries: 3
indices.cluster.send_refresh_mapping: false
index.merge.policy.max_merge_at_once: 10
index.merge.policy.reclaim_deletes_weight: 2.0
index.merge.policy.max_merged_segment: 5GB
index.merge.policy.expunge_deletes_allowed: 10
index.merge.policy.segments_per_tier: 10

Step-4: Limits configuration setting under /etc/security/limits.conf

End of file

End of file

End of file

  •      soft     nproc          65535
  •      hard     nproc          65535
  •      soft     nofile         65535
  •      hard     nofile         65535

elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
elasticsearch soft nproc 65535
elasticsearch hard nproc 65535
elasticsearch soft nofile 65535
elasticsearch hard nofile 65535
app soft nofile 16384
app hard nofile 16384

We have checked sestatus its already disabled in all the nodes (Centos7 we are using).
SELinux status: disabled


free -g
total used free shared buff/cache available
Mem: 503 91 410 0 1 410
Swap: 0 0 0

and also we are dropping the caches for every 5 mints.
#Drop the page cache
*/5 * * * * sync; echo 1 > /proc/sys/vm/drop_caches

We implemented above all the steps but still data nodes are using 24 to 25GB out of 30GB (90%) every time and its not releasing the GC and cluster become red and nodes are going down.

Please suggest me anything we missed setting and configurations to fix this heap issue.

(Christian Dahlqvist) #2

Even though that is a very old version, I believe a lot in this webinar should still be applicable.