Elasticsearch - Garbage Collection Issues

Our Elasticsearch cluster is experiencing quite a few issues with garbage collection and "stop the world" events.

My questions.

  • Is there any benefit from using Oracle Java vs the OpenJDK? I am using the OpenJDK.
  • Can each node run a different type garbage collector?
  • Do I need more nodes? ( I think so but....)
  • Should I be using more shards? I kept them low due to low number of nodes.

I have a 7 node cluster running Elasticsearch 5.9.6.

  • 3 Masters - 2 CPU with 10 GB RAM - Heap 4 GB
  • 2 Data - 16 CPU with 64 GB RAM - 45 TB ZFS data storage - Heap 28 GB
  • 1 Data - 16 CPU with 32 GB RAM - 45 TB ZFS data storage - Heap 20 GB
  • 1 API - 2 CPU with 10 GB RAM - Heap 4 GB

The cluster is running behind a Graylog Cluster. The data is as follows.

  • 1,168 Indexes
  • 10.8 TB data
  • 15,633,087,989 documents
  • 14 shards per index
    Indexes are rotated daily. Each index is around 500GB in size.

All nodes are using CMS GC at this time except one. A data node is using G1GC.

I am constantly seeing one or more data nodes start to garbage collect heavily. This causes long timeouts and finally the node will run out of memory.

[2019-01-29T17:30:04,764][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6355] overhead, spent [17.7s] collecting in the last [18.2s]
[2019-01-29T17:30:21,462][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][old][6358][494] duration [13.6s], collections [1]/[14.4s], total [13.6s]/[1.4h], memory [25.7gb]->[24.7gb]/[25.8gb], all_pools {[young] [1.4gb]->[592.7mb]/[1.4gb]}{[survivor] [102.9mb]->[0b]/[191.3mb]}{[old] [24.1gb]->[24.1gb]/[24.1gb]}
[2019-01-29T17:30:21,463][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6358] overhead, spent [13.6s] collecting in the last [14.4s]
[2019-01-29T17:30:36,172][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][old][6361][495] duration [11.7s], collections [1]/[12.4s], total [11.7s]/[1.4h], memory [25.7gb]->[24.7gb]/[25.8gb], all_pools {[young] [1.4gb]->[623mb]/[1.4gb]}{[survivor] [116.4mb]->[0b]/[191.3mb]}{[old] [24.1gb]->[24.1gb]/[24.1gb]}
[2019-01-29T17:30:36,174][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6361] overhead, spent [11.7s] collecting in the last [12.4s]
[2019-01-29T17:30:55,435][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][old][6363][496] duration [17.4s], collections [1]/[18.2s], total [17.4s]/[1.4h], memory [25.7gb]->[24.7gb]/[25.8gb], all_pools {[young] [1.4gb]->[611.3mb]/[1.4gb]}{[survivor] [75.6mb]->[0b]/[191.3mb]}{[old] [24.1gb]->[24.1gb]/[24.1gb]}
[2019-01-29T17:30:55,436][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6363] overhead, spent [17.4s] collecting in the last [18.2s]
[2019-01-29T17:30:56,442][INFO ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6364] overhead, spent [276ms] collecting in the last [1s]
[2019-01-29T17:31:15,753][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][old][6366][497] duration [17.6s], collections [1]/[18.3s], total [17.6s]/[1.4h], memory [25.6gb]->[24.7gb]/[25.8gb], all_pools {[young] [1.4gb]->[634.3mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [24.1gb]->[24.1gb]/[24.1gb]}
[2019-01-29T17:31:15,754][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6366] overhead, spent [17.6s] collecting in the last [18.3s]

I can post the cluster stats if that helps.

Cluster Stats

{
  "_nodes" : {
    "total" : 7,
    "successful" : 7,
    "failed" : 0
  },
  "cluster_name" : "graylog01",
  "timestamp" : 1548785990115,
  "status" : "yellow",
  "indices" : {
    "count" : 928,
    "shards" : {
      "total" : 6225,
      "primaries" : 5566,
      "replication" : 0.11839741286381603,
      "index" : {
        "shards" : {
          "min" : 5,
          "max" : 10,
          "avg" : 6.707974137931035
        },
        "primaries" : {
          "min" : 5,
          "max" : 6,
          "avg" : 5.997844827586207
        },
        "replication" : {
          "min" : 0.0,
          "max" : 0.6666666666666666,
          "avg" : 0.11835488505747124
        }
      }
    },
    "docs" : {
      "count" : 19268891934,
      "deleted" : 2003804
    },
    "store" : {
      "size" : "14.9tb",
      "size_in_bytes" : 16472553443386,
      "throttle_time" : "0s",
      "throttle_time_in_millis" : 0
    },
    "fielddata" : {
      "memory_size" : "0b",
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "29.1mb",
      "memory_size_in_bytes" : 30527341,
      "total_count" : 603934,
      "hit_count" : 746,
      "miss_count" : 603188,
      "cache_size" : 301,
      "cache_count" : 327,
      "evictions" : 26
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 11877,
      "memory" : "28.7gb",
      "memory_in_bytes" : 30902559251,
      "terms_memory" : "22.7gb",
      "terms_memory_in_bytes" : 24386771683,
      "stored_fields_memory" : "4.8gb",
      "stored_fields_memory_in_bytes" : 5189843120,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "2.8mb",
      "norms_memory_in_bytes" : 3028288,
      "points_memory" : "1gb",
      "points_memory_in_bytes" : 1100915820,
      "doc_values_memory" : "211.7mb",
      "doc_values_memory_in_bytes" : 222000340,
      "index_writer_memory" : "61.6mb",
      "index_writer_memory_in_bytes" : 64637920,
      "version_map_memory" : "2.4mb",
      "version_map_memory_in_bytes" : 2543112,
      "fixed_bit_set" : "0b",
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : 9223372036854775807,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 7,
      "data" : 3,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 7
    },
    "versions" : [
      "5.6.14",
      "5.6.9",
      "5.6.13"
    ],
    "os" : {
      "available_processors" : 104,
      "allocated_processors" : 104,
      "names" : [
        {
          "name" : "Linux",
          "count" : 7
        }
      ],
      "mem" : {
        "total" : "222.1gb",
        "total_in_bytes" : 238494019584,
        "free" : "14.2gb",
        "free_in_bytes" : 15324508160,
        "used" : "207.8gb",
        "used_in_bytes" : 223169511424,
        "free_percent" : 6,
        "used_percent" : 94
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 11
      },
      "open_file_descriptors" : {
        "min" : 273,
        "max" : 8620,
        "avg" : 2864
      }
    },
    "jvm" : {
      "max_uptime" : "19h",
      "max_uptime_in_millis" : 68517319,
      "versions" : [
        {
          "version" : "1.8.0_191",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.191-b12",
          "vm_vendor" : "Oracle Corporation",
          "count" : 7
        }
      ],
      "mem" : {
        "heap_used" : "54.8gb",
        "heap_used_in_bytes" : 58900967760,
        "heap_max" : "86.7gb",
        "heap_max_in_bytes" : 93145202688
      },
      "threads" : 1049
    },
    "fs" : {
      "total" : "126.6tb",
      "total_in_bytes" : 139268324581376,
      "free" : "101.6tb",
      "free_in_bytes" : 111729502023680,
      "available" : "101.6tb",
      "available_in_bytes" : 111715770298368,
      "spins" : "true"
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 7
      },
      "http_types" : {
        "netty4" : 7
      }
    }
  }
}

As far as I can tell from the statistics, you have a number of problems:

  1. You have 3 different versions of Elasticsearch in the cluster. This can cause problems if shards need to be moved within the cluster.
  2. Your data nodes do not have the same specification or amount of heap. Elasticsearch will assume the nodes are equal, so this could lead to imbalance and added pressure on the weaker node.
  3. Given the number of data nodes and the amount of data in the cluster you have far too many indices and shards. I do not consider this a low number of shards at all. Please read this blog post for practical guidance about shards and sharding.

Thanks for the help.

  1. Don't know how I missed this. This should be an easy fix. Looks like all my master nodes are on different versions.
  2. Yes, this is our fault. Found out the third data node did not have ECC memory installed when we were adding memory. This should be fixed as soon as the new memory is on order. I assumed this was going to cause issues and I know that that node may fail. (and does)
  3. Re-reading the blog again. I came up with new questions
  • I remember setting the shards to 14 to keep them around 35 GB in size. (500 GB index / 35 GB goal shard size = 14.2 indexes) I see further info that states you want to keep under 600 shards for a 30 GB heap as a rule of thumb. If I target 450 shards per data node then that means I need around 30 data nodes of my current size (CPU and Memory, not storage) to effectually run this system as is. Correct? (14 shards * 365 days) * 2 years (current data) = 10220 Total Current Shards / 450 per node goal = 23 data nodes for current data)
  • Is the 600 shards and 30 GB heap per node including replica shards? I have a replication set to 1 which will double all my counts.
  • How big is too big for indexes? I am rolling over the index every day for a 500 GB index. I am thinking of changing the rollover to every 2 days to 4 days to get the index number down. I know this will not solve the shard count as I will still target the 35 GB size per shard.

When I look at the stats you provided I see you have 14.9TB across 6225 shards, which gives an average shard size of around 2.5GB. This is a lot less than the 35GB you mention so I suspect you may have different types of indices where some have much smaller shards.

Yes, this includes replica shards as they also take up resources and need to be kept track of in the cluster state.

It varies from use case to use case and the data being indexed, but when long retention is required I have seen shard sizes well beyond 50GB in size. As you have a use case with long retention period I would also recommend you watch this webinar about optimising for storage in Elasticsearch.

Thanks again for all the help here.

After double checking our tracking app I found we had a conversion issue. Indexes are 50 GB on average not 500 GB. Given that information I am going to readjust the number of shards to 3 for now. One for each node.

I going to reread the article on merging shards and I will read/watch the webinar that you posted. My biggest issue, as I see it, is I have two conflicting requirements. Long retention and fast access for external apps to all the data. (Grafana and alerts in Graylog)

Again, thanks for a second set of eyes on this issue. I should be able to readjust the cluster for stability while I plan on a new cluster setup.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.