Elaticsearch Memory consumption increasing too much in ELK 7

shivani_aggarwal · November 30, 2019, 6:58pm

Hi,
I have upgraded ELK from version 6.6.1 to 7.0.1 in kubernetes environment.There are 11 nodes in the cluster - 3 master pods, 3 data pods & 5 client pods. I see memory consumption of elasticsearch is increasing indefinitely.
JVM configured for each of the pods:

master: -Xms1g -Xmx1g  
client: Xms16g -Xmx16g
data:  -Xms12g -Xmx12g

Cluster health

$ curl -k 'https://elasticsearch.default.svc.cluster.local:9200/_cluster/health?pretty' -uadmin:admin
{
  "cluster_name" : "elk-efkc",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 11,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 307,
  "active_shards" : 556,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 58,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 90.55374592833876
}

Cluster Nodes

$ curl -k 'https://elasticsearch.default.svc.cluster.local:9200/_cat/nodes?v' -uadmin:admin
ip              heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.116.176           70          99  15    3.03    4.30     4.79 di        -      elk-efkc-elk-elasticsearch-data-2
192.168.126.80            72          87   9    1.12    1.07     1.19 i         -      elk-efkc-elk-elasticsearch-client-5d4c8b9f8f-kklsp
192.168.30.247            14          93  21    4.56    5.36     5.56 mi        -      elk-efkc-elk-elasticsearch-master-1
192.168.30.253            79          93  21    4.28    5.29     5.54 di        -      elk-efkc-elk-elasticsearch-data-1
192.168.225.151           75          93  16    3.99    4.50     4.90 di        -      elk-efkc-elk-elasticsearch-data-0
192.168.147.85            17          23   6    1.57    1.55     1.28 mi        -      elk-efkc-elk-elasticsearch-master-0
192.168.27.218            39          68  10    2.12    2.15     2.04 mi        *      elk-efkc-elk-elasticsearch-master-2
192.168.27.217            73          68  10    2.12    2.15     2.04 i         -      elk-efkc-elk-elasticsearch-client-5d4c8b9f8f-db9nd
192.168.147.51            67          56  15    2.28    2.43     2.48 i         -      elk-efkc-elk-elasticsearch-client-5d4c8b9f8f-wz5kn
192.168.7.143             71          59  13    1.19    1.69     1.88 i         -      elk-efkc-elk-elasticsearch-client-5d4c8b9f8f-7vkdf
192.168.250.158           70          67   8    1.04    1.05     1.16 i         -      elk-efkc-elk-elasticsearch-client-5d4c8b9f8f-v445g

Memory utilization of pods

$ kubectl top pods
NAME                                                         CPU(cores)   MEMORY(bytes)
elk-efkc-elk-elasticsearch-client-5d4c8b9f8f-7vkdf           1140m        21131Mi
elk-efkc-elk-elasticsearch-client-5d4c8b9f8f-db9nd           759m         23241Mi
elk-efkc-elk-elasticsearch-client-5d4c8b9f8f-kklsp           489m         24213Mi
elk-efkc-elk-elasticsearch-client-5d4c8b9f8f-v445g           353m         22417Mi
elk-efkc-elk-elasticsearch-client-5d4c8b9f8f-wz5kn           1436m        18741Mi
elk-efkc-elk-elasticsearch-data-0                            1804m        28096Mi
elk-efkc-elk-elasticsearch-data-1                            1564m        28941Mi
elk-efkc-elk-elasticsearch-data-2                            1810m        29922Mi
elk-efkc-elk-elasticsearch-exporter-768fb678b9-tvtm9         2m           53Mi
elk-efkc-elk-elasticsearch-master-0                          4m           1328Mi
elk-efkc-elk-elasticsearch-master-1                          4m           1318Mi
elk-efkc-elk-elasticsearch-master-2                          15m          1371Mi

Cluster allocation

curl -k 'https://elasticsearch.paas.svc.cluster.local:9200/_cat/allocation?v' -uadmin:admin
    shards disk.indices disk.used disk.avail disk.total disk.percent host            ip              node
       192       45.5gb    70.5gb    129.3gb    199.9gb           35 192.168.225.151 192.168.225.151 elk-efkc-elk-elasticsearch-data-0
       187       48.1gb    64.3gb    135.5gb    199.9gb           32 192.168.30.253  192.168.30.253  elk-efkc-elk-elasticsearch-data-1
       183       21.4gb    43.6gb    156.2gb    199.9gb           21 192.168.116.176 192.168.116.176 elk-efkc-elk-elasticsearch-data-2
        58                                                                                           UNASSIGNED

I am facing this issue only after upgrade to ELK 7.0.1 where the pod's memory is increasing even upto 30gb. I see such CircuitBreakerException errors in data pods -

    log":"[[raghu-impact-log-2019.09.03][0]] failed to perform indices:data/write/bulk[s] on replica [raghu-impact-log-2019.09.03][0], node[8aeYmmTUSdKTt6FVY7_Lew], [R], s[STARTED], a[id=A8L0r5ugQWyDgK1kYQjgrw]"}
org.elasticsearch.transport.RemoteTransportException: [elk-efkc-elk-elasticsearch-data-2][192.168.116.176:9300][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [16544981698/15.4gb], which is larger than the limit of [16304314777/15.1gb], real usage: [16544880096/15.4gb], new bytes reserved: [101602/99.2kb]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1026) [elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:922) [elasticsearch-7.0.1.jar:7.0.1]

There are 58 unassigned shards - the explaination for unassigned shard also shows the same CircuitBreakerException.

This is my GC configuration:

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

What could be the reason of so much memory being utilized by elasticsearch? and what would be the way to control this?

Thanks,
Shivani

shivani_aggarwal · November 30, 2019, 7:45pm

Node_stats for one elasticsearch data pod.

      "5EkiLFfNQ4OSMXleagH3TA" : {
  "timestamp" : 1575121978314,
  "name" : "elk-efkc-elk-elasticsearch-data-0",
  "transport_address" : "192.168.225.151:9300",
  "host" : "192.168.225.151",
  "ip" : "192.168.225.151:9300",
  "roles" : [
    "data",
    "ingest"
  ],
  "indices" : {
    "docs" : {
      "count" : 493269950,
      "deleted" : 0
    },
    "store" : {
      "size_in_bytes" : 52703297185
    },
    "indexing" : {
      "index_total" : 1478966250,
      "index_time_in_millis" : 289743332,
      "index_current" : 8,
      "index_failed" : 0,
      "delete_total" : 0,
      "delete_time_in_millis" : 0,
      "delete_current" : 0,
      "noop_update_total" : 0,
      "is_throttled" : false,
      "throttle_time_in_millis" : 0
    },
    "segments" : {
      "count" : 1277,
      "memory_in_bytes" : 201728207,
      "terms_memory_in_bytes" : 159358723,
      "stored_fields_memory_in_bytes" : 35436464,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 1394752,
      "points_memory_in_bytes" : 5077168,
      "doc_values_memory_in_bytes" : 461100,
      "index_writer_memory_in_bytes" : 8625512,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : 1575118013485,
      "file_sizes" : { }
    },
    "translog" : {
      "operations" : 40079980,
      "size_in_bytes" : 18363882056,
      "uncommitted_operations" : 10606046,
      "uncommitted_size_in_bytes" : 4893639256,
      "earliest_last_modified_age" : 0
    },
    "request_cache" : {
      "memory_size_in_bytes" : 0,
      "evictions" : 0,
      "hit_count" : 0,
      "miss_count" : 0
    },
    "recovery" : {
      "current_as_source" : 0,
      "current_as_target" : 0,
      "throttle_time_in_millis" : 1122504
    }
  },
  "os" : {
    "timestamp" : 1575121978785,
    "cpu" : {
      "percent" : 16,
      "load_average" : {
        "1m" : 5.67,
        "5m" : 5.2,
        "15m" : 4.83
      }
    },
    "mem" : {
      "total_in_bytes" : 50475667456,
      "free_in_bytes" : 394878976,
      "used_in_bytes" : 50080788480,
      "free_percent" : 1,
      "used_percent" : 99
    },
    "swap" : {
      "total_in_bytes" : 0,
      "free_in_bytes" : 0,
      "used_in_bytes" : 0
    },
    "cgroup" : {
      "cpuacct" : {
        "control_group" : "/",
        "usage_nanos" : 315746857027626
      },
      "cpu" : {
        "control_group" : "/",
        "cfs_period_micros" : 100000,
        "cfs_quota_micros" : 200000,
        "stat" : {
          "number_of_elapsed_periods" : 1750785,
          "number_of_times_throttled" : 1361711,
          "time_throttled_nanos" : 111627479383624
        }
      },
      "memory" : {
        "control_group" : "/",
        "limit_in_bytes" : "9223372036854771712",
        "usage_in_bytes" : "41915486208"
      }
    }
  },
  "process" : {
    "timestamp" : 1575121978785,
    "open_file_descriptors" : 3429,
    "max_file_descriptors" : 1048576,
    "cpu" : {
      "percent" : 12,
      "total_in_millis" : 315664090
    },
    "mem" : {
      "total_virtual_in_bytes" : 46222389248
    }
  },
  "jvm" : {
    "timestamp" : 1575121978787,
    "uptime_in_millis" : 175727892,
    "mem" : {
      "heap_used_in_bytes" : 11549223304,
      "heap_used_percent" : 67,
      "heap_committed_in_bytes" : 17162436608,
      "heap_max_in_bytes" : 17162436608,
      "non_heap_used_in_bytes" : 140710464,
      "non_heap_committed_in_bytes" : 165888000,
      "pools" : {
        "young" : {
          "used_in_bytes" : 8649600,
          "max_in_bytes" : 139591680,
          "peak_used_in_bytes" : 139591680,
          "peak_max_in_bytes" : 139591680
        },
        "survivor" : {
          "used_in_bytes" : 15328776,
          "max_in_bytes" : 17432576,
          "peak_used_in_bytes" : 17432576,
          "peak_max_in_bytes" : 17432576
        },
        "old" : {
          "used_in_bytes" : 11525266336,
          "max_in_bytes" : 17005412352,
          "peak_used_in_bytes" : 16995926272,
          "peak_max_in_bytes" : 17005412352
        }
      }
    },
    "threads" : {
      "count" : 56,
      "peak_count" : 124
    },
    "gc" : {
      "collectors" : {
        "young" : {
          "collection_count" : 384559,
          "collection_time_in_millis" : 44552874
        },
        "old" : {
          "collection_count" : 2805,
          "collection_time_in_millis" : 12028517
        }
      }
    },
    "buffer_pools" : {
      "mapped" : {
        "count" : 1800,
        "used_in_bytes" : 17668608938,
        "total_capacity_in_bytes" : 17668608938
      },
      "direct" : {
        "count" : 53,
        "used_in_bytes" : 252111096,
        "total_capacity_in_bytes" : 252111095
      }
    },
    "classes" : {
      "current_loaded_count" : 14633,
      "total_loaded_count" : 14790,
      "total_unloaded_count" : 157
    }
  },
  "thread_pool" : {
    "analyze" : {
      "threads" : 0,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 0,
      "completed" : 0
    },
    "fetch_shard_started" : {
      "threads" : 1,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 4,
      "completed" : 643
    },
    "fetch_shard_store" : {
      "threads" : 1,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 4,
      "completed" : 13956
    },
    "flush" : {
      "threads" : 1,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 1,
      "completed" : 12944
    },
    "force_merge" : {
      "threads" : 0,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 0,
      "completed" : 0
    },
    "generic" : {
      "threads" : 26,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 91,
      "completed" : 1204816
    },
    "get" : {
      "threads" : 2,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 2,
      "completed" : 3
    },
    "listener" : {
      "threads" : 0,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 0,
      "completed" : 0
    },
    "management" : {
      "threads" : 5,
      "queue" : 0,
      "active" : 1,
      "rejected" : 0,
      "largest" : 5,
      "completed" : 1971753
    },
    "refresh" : {
      "threads" : 1,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 1,
      "completed" : 21774006
    },
    "search" : {
      "threads" : 0,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 0,
      "completed" : 0
    },
    "search_throttled" : {
      "threads" : 0,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 0,
      "completed" : 0
    },
    "snapshot" : {
      "threads" : 0,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 0,
      "completed" : 0
    },
    "warmer" : {
      "threads" : 1,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 1,
      "completed" : 2
    },
    "write" : {
      "threads" : 2,
      "queue" : 202,
      "active" : 2,
      "rejected" : 45370,
      "largest" : 2,
      "completed" : 45903
    }
  },
  "fs" : {
    "timestamp" : 1575121978788,
    "total" : {
      "total_in_bytes" : 214643507200,
      "free_in_bytes" : 135347372032,
      "available_in_bytes" : 135347372032
    },
    "data" : [
      {
        "path" : "/data/data/nodes/0",
        "mount" : "/data (/dev/rbd2)",
        "type" : "xfs",
        "total_in_bytes" : 214643507200,
        "free_in_bytes" : 135347372032,
        "available_in_bytes" : 135347372032
      }
    ],
    "io_stats" : {
      "devices" : [
        {
          "device_name" : "rbd2",
          "operations" : 13780090,
          "read_operations" : 8190000,
          "write_operations" : 5590090,
          "read_kilobytes" : 115654702,
          "write_kilobytes" : 1420298697
        }
      ],
      "total" : {
        "operations" : 13780090,
        "read_operations" : 8190000,
        "write_operations" : 5590090,
        "read_kilobytes" : 115654702,
        "write_kilobytes" : 1420298697
      }
    }
  },
  "breakers" : {
    "request" : {
      "limit_size_in_bytes" : 10297461964,
      "limit_size" : "9.5gb",
      "estimated_size_in_bytes" : 0,
      "estimated_size" : "0b",
      "overhead" : 1.0,
      "tripped" : 0
    },
    "fielddata" : {
      "limit_size_in_bytes" : 6864974643,
      "limit_size" : "6.3gb",
      "estimated_size_in_bytes" : 0,
      "estimated_size" : "0b",
      "overhead" : 1.03,
      "tripped" : 0
    },
    "in_flight_requests" : {
      "limit_size_in_bytes" : 17162436608,
      "limit_size" : "15.9gb",
      "estimated_size_in_bytes" : 2666534901,
      "estimated_size" : "2.4gb",
      "overhead" : 2.0,
      "tripped" : 0
    },
    "accounting" : {
      "limit_size_in_bytes" : 17162436608,
      "limit_size" : "15.9gb",
      "estimated_size_in_bytes" : 201738011,
      "estimated_size" : "192.3mb",
      "overhead" : 1.0,
      "tripped" : 0
    },
    "parent" : {
      "limit_size_in_bytes" : 16304314777,
      "limit_size" : "15.1gb",
      "estimated_size_in_bytes" : 11550036984,
      "estimated_size" : "10.7gb",
      "overhead" : 1.0,
      "tripped" : 23032
    }
  }..
},

system · December 28, 2019, 7:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.