2.3.3 heap used too high,jvm old gen percent 100%,not crash


#1

Elasticsearch version:2.3.3

Plugins installed: []

JVM version:1.8_091

OS version:linux-3.10.101kernel

Description of the problem including expected versus actual behavior:
We deploy 12 es nodes in 3 machine. After we insert 1 billion record into es cluster and Meanwhile,we also send query request to es for search. After 3 day passed, es's jvm heap FGC frequently.the node's jvm gcutil looks like that: total 3.8GB for ES_HEAP, 2.6GB for jvm's old gen, and the old gen used is 100%.
We stop all write and read request,but the old gen used could not decrease. the old gen used is 100%.
we clear cache of es,but the old gen used only decrease to 96%.
Steps to reproduce:
1.use config see below
2.do insert and query
3.after data increase to 1 billion, heap used become 100%.

Provide logs (if relevant):
[2017-02-05 11:58:38,221][INFO ][monitor.jvm ] [xxx] [gc][old][1332853][268813] duration [5s], collections [1]/[5.1s], total [5s]/[17.2h], memory [3.8gb]->[2.8gb]/[3.8gb], all_pools {[young] [1gb]->[166.8mb]/[1gb]}{[survivor] [130.7mb]->[0b]/[136.5mb]}{[old] [2.6gb]->[2.6gb]/[2.6gb]}
[2017-02-05 12:04:39,450][INFO ][monitor.jvm ] [xxx] [gc][old][1333207][268917] duration [6.2s], collections [1]/[6.5s], total [6.2s]/[17.2h], memory [3.8gb]->[2.8gb]/[3.8gb], all_pools {[young] [1gb]->[165.2mb]/[1gb]}{[survivor] [111mb]->[0b]/[136.5mb]}{[old] [2.6gb]->[2.6gb]/[2.6gb]}
[2017-02-05 12:10:10,528][INFO ][monitor.jvm ] [xxx] [gc][old][1333531][269014] duration [6.1s], collections [1]/[6.7s], total [6.1s]/[17.2h], memory [3.8gb]->[2.8gb]/[3.8gb], all_pools {[young] [1gb]->[167.1mb]/[1gb]}{[survivor] [97.3mb]->[0b]/[136.5mb]}{[old] [2.6gb]->[2.6gb]/[2.6gb]}
[2017-02-05 12:33:23,673][INFO ][monitor.jvm ] [xxx] [gc][old][1334903][269423] duration [5.1s], collections [1]/[5.3s], total [5.1s]/[17.3h], memory [3.8gb]->[2.8gb]/[3.8gb], all_pools {[young] [1gb]->[168.3mb]/[1gb]}{[survivor] [119.5mb]->[0b]/[136.5mb]}{[old] [2.6gb]->[2.6gb]/[2.6gb]}
[2017-02-05 14:12:11,023][INFO ][monitor.jvm ] [xxx] [gc][old][1340737][271155] duration [5s], collections [1]/[5.8s], total [5s]/[17.4h], memory [3.8gb]->[2.8gb]/[3.8gb], all_pools {[young] [1gb]->[166.2mb]/[1gb]}{[survivor] [128.7mb]->[0b]/[136.5mb]}{[old] [2.6gb]->[2.6gb]/[2.6gb]}

Our elasticsearch.yml:
node.master: false
 node.data: true
 node.zone: findbugs_grp9
#
# Add custom attributes to the node:
#
# node.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
 path.data: /data1/es_Data_2017-01-20_164005/es_0/data
#
# Path to log files:
#
 path.logs: /data1/es_Data_2017-01-20_164005/es_0/logs
#
# ---------------------------------- indices -----------------------------------
 index.number_of_shards: 12
 index.number_of_replicas: 1
# index.refresh_interval: "10s"

# ----------------------------------- Threadpool -----------------------------------
 threadpool.bulk.queue_size: 400

# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
 bootstrap.mlockall: true
#
# Make sure that the `ES_HEAP_SIZE` environment variable is set to about half the memory
# available on the system and that the owner of the process is allowed to use this limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
 network.host: xxx
#
# Set a custom port for HTTP:
#
 http.port: 9200
 transport.tcp.port: 9300
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html>
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
 discovery.zen.ping.multicast.enabled: false
 discovery.zen.ping.unicast.hosts: ["xxx:9309","100.65.8.219:9309","10.231.135.40:9309"]
# Prevent the "split brain" by configuring the majority of nodes (total number of nodes / 2 + 1):
#
 discovery.zen.minimum_master_nodes: 2
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery.html>
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
 gateway.recover_after_data_nodes: 4
 #gateway.recover_after_nodes: 4
 gateway.expected_data_nodes: 8
 gateway.recover_after_time: 5m
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-gateway.html>
#
# ---------------------------------- Various -----------------------------------
#
# Disable starting multiple nodes on a single system:
#
# node.max_local_storage_nodes: 4
#
# Require explicit names when deleting indices:
#
# action.destructive_requires_name: true
#
# ---------------------------------- Cluster -----------------------------------
 cluster.routing.allocation.awareness.attributes: zone
 cluster.routing.allocation.awareness.force.zone.values: findbugs_grp9,findbugs_grp219,findbugs_grp40

(David Pilato) #2

Few questions:

  • How much heap set to each node?
  • How much RAM on the physical machine?
  • Why 4 es nodes per machine?

#3
  1. 3.8 GB for each node.
  2. 32GB RAM on the physical machine.
  3. We do some test for performance and use this solution.

I think one of the reason for this case is that es's different module cost some percent ofr heap,loke query_cache/segment memory etc. but I try to analyse the used of that.I could not find where is other heap gone beside I know.here is the es stats:

  "indices": {
    "docs": {
      "count": 249754188,
      "deleted": 0
    },
    "store": {
      "size": "90.8gb",
      "size_in_bytes": 97541992687,
      "throttle_time": "0s",
      "throttle_time_in_millis": 0
    },
    "indexing": {
      "index_total": 249754188,
      "index_time": "2.5d",
      "index_time_in_millis": 219188897,
      "index_current": 97,
      "index_failed": 0,
      "delete_total": 0,
      "delete_time": "0s",
      "delete_time_in_millis": 0,
      "delete_current": 0,
      "noop_update_total": 0,
      "is_throttled": false,
      "throttle_time": "0s",
      "throttle_time_in_millis": 0
    },
    "get": {
      "total": 0,
      "get_time": "0s",
      "time_in_millis": 0,
      "exists_total": 0,
      "exists_time": "0s",
      "exists_time_in_millis": 0,
      "missing_total": 0,
      "missing_time": "0s",
      "missing_time_in_millis": 0,
      "current": 0
    },
    "search": {
      "open_contexts": 0,
      "query_total": 4496535,
      "query_time": "4.5d",
      "query_time_in_millis": 396361166,
      "query_current": 0,
      "fetch_total": 395280,
      "fetch_time": "2.4m",
      "fetch_time_in_millis": 145452,
      "fetch_current": 0,
      "scroll_total": 0,
      "scroll_time": "0s",
      "scroll_time_in_millis": 0,
      "scroll_current": 0
    },
    "merges": {
      "current": 0,
      "current_docs": 0,
      "current_size": "0b",
      "current_size_in_bytes": 0,
      "total": 1209185,
      "total_time": "7.9d",
      "total_time_in_millis": 686168988,
      "total_docs": 5901205860,
      "total_size": "2.2tb",
      "total_size_in_bytes": 2508011433969,
      "total_stopped_time": "0s",
      "total_stopped_time_in_millis": 0,
      "total_throttled_time": "1.1h",
      "total_throttled_time_in_millis": 4024204,
      "total_auto_throttle": "3.1gb",
      "total_auto_throttle_in_bytes": 3337141895
    },
    "refresh": {
      "total": 8746196,
      "total_time": "3.1d",
      "total_time_in_millis": 268609997
    },
    "flush": {
      "total": 18074,
      "total_time": "20m",
      "total_time_in_millis": 1203435
    },
    "warmer": {
      "current": 0,
      "total": 18696826,
      "total_time": "45.5m",
      "total_time_in_millis": 2733193
    },
    "query_cache": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "total_count": 666869563,
      "hit_count": 48269663,
      "miss_count": 618599900,
      "cache_size": 0,
      "cache_count": 1313121,
      "evictions": 1313121
    },
    "fielddata": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "evictions": 0
    },
    "percolate": {
      "total": 0,
      "time": "0s",
      "time_in_millis": 0,
      "current": 0,
      "memory_size_in_bytes": -1,
      "memory_size": "-1b",
      "queries": 0
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 3148,
      "memory": "640.1mb",
      "memory_in_bytes": 671294503,
      "terms_memory": "619mb",
      "terms_memory_in_bytes": 649123303,
      "stored_fields_memory": "16.3mb",
      "stored_fields_memory_in_bytes": 17176376,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "195kb",
      "norms_memory_in_bytes": 199680,
      "doc_values_memory": "4.5mb",
      "doc_values_memory_in_bytes": 4795144,
      "index_writer_memory": "0b",
      "index_writer_memory_in_bytes": 0,
      "index_writer_max_memory": "105.4mb",
      "index_writer_max_memory_in_bytes": 110592000,
      "version_map_memory": "0b",
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set": "0b",
      "fixed_bit_set_memory_in_bytes": 0
    },
    "translog": {
      "operations": 0,
      "size": "9kb",
      "size_in_bytes": 9288
    },
    "suggest": {
      "total": 0,
      "time": "0s",
      "time_in_millis": 0,
      "current": 0
    },
    "request_cache": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "evictions": 0,
      "hit_count": 0,
      "miss_count": 0
    },
    "recovery": {
      "current_as_source": 0,
      "current_as_target": 0,
      "throttle_time": "0s",
      "throttle_time_in_millis": 0
    }
  },
  "os": {
    "timestamp": 1486293856899,
    "cpu_percent": 12,
    "load_average": 2.58,
    "mem": {
      "total": "62.2gb",
      "total_in_bytes": 66878009344,
      "free": "362mb",
      "free_in_bytes": 379645952,
      "used": "61.9gb",
      "used_in_bytes": 66498363392,
      "free_percent": 1,
      "used_percent": 99
    },
    "swap": {
      "total": "1.9gb",
      "total_in_bytes": 2139090944,
      "free": "1.4gb",
      "free_in_bytes": 1553444864,
      "used": "558.5mb",
      "used_in_bytes": 585646080
    }
  },
  "process": {
    "timestamp": 1486293856900,
    "open_file_descriptors": 5900,
    "max_file_descriptors": 204800,
    "cpu": {
      "percent": 2,
      "total": "27.5d",
      "total_in_millis": 2376716310
    },
    "mem": {
      "total_virtual": "69.3gb",
      "total_virtual_in_bytes": 74429259776
    }
  },
  "jvm": {
    "timestamp": 1486293856903,
    "uptime": "16.1d",
    "uptime_in_millis": 1392251394,
    "mem": {
      "heap_used": "3.4gb",
      "heap_used_in_bytes": 3712935160,
      "heap_used_percent": 89,
      "heap_committed": "3.8gb",
      "heap_committed_in_bytes": 4151836672,
      "heap_max": "3.8gb",
      "heap_max_in_bytes": 4151836672,
      "non_heap_used": "99mb",
      "non_heap_used_in_bytes": 103906640,
      "non_heap_committed": "100.8mb",
      "non_heap_committed_in_bytes": 105717760,
      "pools": {
        "young": {
          "used": "941.1mb",
          "used_in_bytes": 986898632,
          "max": "1gb",
          "max_in_bytes": 1145372672,
          "peak_used": "1gb",
          "peak_used_in_bytes": 1145372672,
          "peak_max": "1gb",
          "peak_max_in_bytes": 1145372672
        },
        "survivor": {
          "used": "612.9kb",
          "used_in_bytes": 627696,
          "max": "136.5mb",
          "max_in_bytes": 143130624,
          "peak_used": "136.5mb",
          "peak_used_in_bytes": 143130624,
          "peak_max": "136.5mb",
          "peak_max_in_bytes": 143130624
        },
        "old": {
          "used": "2.5gb",
          "used_in_bytes": 2725408832,
          "max": "2.6gb",
          "max_in_bytes": 2863333376,
          "peak_used": "2.6gb",
          "peak_used_in_bytes": 2863333376,
          "peak_max": "2.6gb",
          "peak_max_in_bytes": 2863333376
        }
      }
    },

#4
  "gc": {
      "collectors": {
        "young": {
          "collection_count": 135207,
          "collection_time": "2.4h",
          "collection_time_in_millis": 8988628
        },
        "old": {
          "collection_count": 275311,
          "collection_time": "17.7h",
          "collection_time_in_millis": 63848204
        }
      }
    },
    "buffer_pools": {
      "direct": {
        "count": 361,
        "used": "82.4mb",
        "used_in_bytes": 86474344,
        "total_capacity": "82.4mb",
        "total_capacity_in_bytes": 86474344
      },
      "mapped": {
        "count": 2352,
        "used": "51.5gb",
        "used_in_bytes": 55397961324,
        "total_capacity": "51.5gb",
        "total_capacity_in_bytes": 55397961324
      }
    }
  },
  
  
  "thread_pool": {
    "bulk": {
      "threads": 24,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 24,
      "completed": 249754481
    },
    "fetch_shard_started": {
      "threads": 0,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 0,
      "completed": 0
    },
    "fetch_shard_store": {
      "threads": 0,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 0,
      "completed": 0
    },
    "flush": {
      "threads": 1,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 5,
      "completed": 127619
    },
    "force_merge": {
      "threads": 0,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 0,
      "completed": 0
    },
    "generic": {
      "threads": 1,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 6,
      "completed": 1495587
    },
    "get": {
      "threads": 0,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 0,
      "completed": 0
    },
    "index": {
      "threads": 0,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 0,
      "completed": 0
    },
    "listener": {
      "threads": 10,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 10,
      "completed": 626
    },
    "management": {
      "threads": 5,
      "queue": 0,
      "active": 1,
      "rejected": 0,
      "largest": 5,
      "completed": 294173
    },
    "percolate": {
      "threads": 0,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 0,
      "completed": 0
    },
    "refresh": {
      "threads": 1,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 10,
      "completed": 8746323
    },
    "search": {
      "threads": 37,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 37,
      "completed": 5045401
    },
    "snapshot": {
      "threads": 0,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 0,
      "completed": 0
    },
    "suggest": {
      "threads": 0,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 0,
      "completed": 0
    },
    "warmer": {
      "threads": 1,
      "queue": 0,
      "active": 0,
      "rejected": 0,
      "largest": 5,
      "completed": 9933076
    }
  },
  "fs": {
    "timestamp": 1486293856903,
    "total": {
      "total": "6.4tb",
      "total_in_bytes": 7086847778816,
      "free": "3.5tb",
      "free_in_bytes": 3941738397696,
      "available": "3.2tb",
      "available_in_bytes": 3581721677824,
      "spins": "true"
    },
    "data": [
      {
        "path": "/data1/es_Data_2017-01-20_164005/es_0/data/xxx_findbugs/nodes/0",
        "mount": "/data1 (/dev/md0)",
        "type": "ext4",
        "total": "6.4tb",
        "total_in_bytes": 7086847778816,
        "free": "3.5tb",
        "free_in_bytes": 3941738397696,
        "available": "3.2tb",
        "available_in_bytes": 3581721677824,
        "spins": "true"
      }
    ]
  },
  "transport": {
    "server_open": 117,
    "rx_count": 448023685,
    "rx_size": "109gb",
    "rx_size_in_bytes": 117042946543,
    "tx_count": 445498089,
    "tx_size": "96.4gb",
    "tx_size_in_bytes": 103534170823
  },
  "http": {
    "current_open": 12,
    "total_opened": 106093
  },
  "breakers": {
    "request": {
      "limit_size_in_bytes": 1660734668,
      "limit_size": "1.5gb",
      "estimated_size_in_bytes": 0,
      "estimated_size": "0b",
      "overhead": 1,
      "tripped": 0
    },
    "fielddata": {
      "limit_size_in_bytes": 2491102003,
      "limit_size": "2.3gb",
      "estimated_size_in_bytes": 0,
      "estimated_size": "0b",
      "overhead": 1.03,
      "tripped": 0
    },
    "parent": {
      "limit_size_in_bytes": 2906285670,
      "limit_size": "2.7gb",
      "estimated_size_in_bytes": 0,
      "estimated_size": "0b",
      "overhead": 1,
      "tripped": 0
    }
  },
  "script": {
    "compilations": 0,
    "cache_evictions": 0
  }
}

(David Pilato) #5

Just create one node per machine with 16 gb of HEAP.


#6

hi,dadoonet. thanks for your answer.:slight_smile:
could you tell me why ? I try to look into the es's implement or es's manual book but I could not get the answer.Because some tell me check es's cache used or fielddata used or segments fielddata used etc,
I have some question:
which part of es cost the heap? Can you give some information to know about this answer? Can I use es's api(like /_cluster/ststs?) know all the heap cost stats,
after i "Just create one node per machine with 16 gb of HEAP." will this case happen again?


(David Pilato) #7

Why?

Because it's not a good practice unless you have more than 64gb of RAM on your machine.

Elasticsearch will make a better use of your resources if you let him do the job.

I can't tell if it will definitely fix your issue but I can guess it can't be worse and it will be probably better.
Give it a try.


#8

Thanks very much!
I will give a try.

Meanwhile,I really want to know why I do this can solve the proble(why jvm's old gen heap is 100%)?
Maybe I should know more about es/lucene.


(David Pilato) #9

How many shards/indices do you have?


#10

77 indices,1848 shards,total record 10 billion.
Thanks reply :slight_smile:


(David Pilato) #11

That's totally normal to run less on memory with so many shards and documents with so little HEAP.

Reduce the number of shards, increase your HEAP and see how it goes.

If you can't reduce the number of shards, start more nodes on more machines.


#12

thanks , fix my reply error "77 indices,1848 shards,total record (10 billion) -> 1 billion."
The problem is that:

  1. Our machine have 5 TB per machine ,so each machine will increase to 10 billion records in future. If the heap size should linear to the document's num. This means that we can not store so much records although most of the records is cold data which will not visit frequently ?

  2. I try to close 1/4 of the indices, the heap old gen used decrease ~ 25%. I only do query without other insert or update request, the old gen heap used will increase to ~100% slowly (about one day, 10 req/s )

In my opinion, If I decide the ES,Although we have so much indices and documents in disk. ES should not use so much heap and do not release the memory after I stop any request ?


(David Pilato) #13

What is the current status? I have hard time to follow.

Did you:

  • Start only one node per machine?
  • Set HEAP size to 16g?
  • Reduce the number of total shards?

If not, please do it before jumping to conclusions.

Having memory used by elasticsearch is fine. Don't expect the HEAP to be released. I'd just make sure that you never go more than 80%.


#14

thanks!


(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.