ES shards optimisation and improvement to cluster health

Hello,
On my cluster,(version 7.1), I keep getting a Yellow state from time to time and an increase in x400 errors, currently, the cluster has 3 master nodes and 4 data nodes with 640Gb of disk space per data node.

curl -X GET  'https://es_domain/_cluster/health?pretty'
{
  "cluster_name" : "elk",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 7,
  "number_of_data_nodes" : 4,
  "discovered_master" : true,
  "active_primary_shards" : 12989,
  "active_shards" : 25978,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

with 114,520,959 searchable documents

the OldCollectionTime has jumped from 300,000 to 600,000 and MaxMemoryUtilisation has jumped from an average of 80% to 95%

i have one node with a CPU of 80% whereas the rest are between 5% to 10%

how do i further investigate these spikes and improve the overall ES cluster?

any advice is much appreciated

You probably have too many shards per node.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

1 Like

With the maths available in those resources, it would mean that you need something like 325gb of HEAP per data node to correctly manage all the shards. :scream:

2 Likes

Hi,
Thanks for the reply.

Can you elaborate on the maths you used to derive the 325gb of HEAP?

Here are the nodes stats for my cluster:

 curl -X GET  'https://es_domain/_nodes/stats/os,process?pretty'                                    
{
  "_nodes" : {
    "total" : 7,
    "successful" : 7,
    "failed" : 0
  },
  "cluster_name" : "elk",
  "nodes" : {
    "85icxeyKSQCwXvLggOwzwQ" : {
      "timestamp" : 1579778170512,
      "name" : "994c05903dfd93be153cf0cac45fd19e",
      "roles" : [ "master" ],
      "os" : {
        "timestamp" : 1579778170513,
        "cpu" : {
          "percent" : 1,
          "load_average" : {
            "1m" : 0.06,
            "5m" : 0.03,
            "15m" : 0.06
          }
        },
        "mem" : {
          "total_in_bytes" : 3890528256,
          "free_in_bytes" : 187572224,
          "used_in_bytes" : 3702956032,
          "free_percent" : 5,
          "used_percent" : 95
        },
        "swap" : {
          "total_in_bytes" : 2147479552,
          "free_in_bytes" : 2045775872,
          "used_in_bytes" : 101703680
        }
      },
      "process" : {
        "timestamp" : 1579778170513,
        "open_file_descriptors" : 917,
        "max_file_descriptors" : 128000,
        "cpu" : {
          "percent" : 0,
          "total_in_millis" : 20256870
        },
        "mem" : {
          "total_virtual_in_bytes" : 5089357824
        }
      }
    },
    "s5Bok49ZT0mFlinXYNoy7w" : {
      "timestamp" : 1579778170514,
      "name" : "1a706559c0edd218b7e6f191eb4e2b84",
      "roles" : [ "data", "ingest" ],
      "os" : {
        "timestamp" : 1579778170514,
        "cpu" : {
          "percent" : 2,
          "load_average" : {
            "1m" : 0.05,
            "5m" : 0.2,
            "15m" : 0.26
          }
        },
        "mem" : {
          "total_in_bytes" : 16823488512,
          "free_in_bytes" : 704200704,
          "used_in_bytes" : 16119287808,
          "free_percent" : 4,
          "used_percent" : 96
        },
        "swap" : {
          "total_in_bytes" : 2147479552,
          "free_in_bytes" : 2135166976,
          "used_in_bytes" : 12312576
        }
      },
      "process" : {
        "timestamp" : 1579778170514,
        "open_file_descriptors" : 55279,
        "max_file_descriptors" : 128000,
        "cpu" : {
          "percent" : 3,
          "total_in_millis" : 206513740
        },
        "mem" : {
          "total_virtual_in_bytes" : 36546838528
        }
      }
    },
    "9v4ajjIaTh-q1aHCfrmCvQ" : {
      "timestamp" : 1579778170513,
      "name" : "0f2202f519ef220131ef421ed0d0dffe",
      "roles" : [ "master" ],
      "os" : {
        "timestamp" : 1579778169522,
        "cpu" : {
          "percent" : 2,
          "load_average" : {
            "1m" : 0.09,
            "5m" : 0.15,
            "15m" : 0.11
          }
        },
        "mem" : {
          "total_in_bytes" : 3919888384,
          "free_in_bytes" : 193150976,
          "used_in_bytes" : 3726737408,
          "free_percent" : 5,
          "used_percent" : 95
        },
        "swap" : {
          "total_in_bytes" : 2147479552,
          "free_in_bytes" : 1995804672,
          "used_in_bytes" : 151674880
        }
      },
      "process" : {
        "timestamp" : 1579778170513,
        "open_file_descriptors" : 901,
        "max_file_descriptors" : 128000,
        "cpu" : {
          "percent" : 2,
          "total_in_millis" : 61056490
        },
        "mem" : {
          "total_virtual_in_bytes" : 5146169344
        }
      }
    },
    "7afZAeyTT_W0Gug5VtfH7Q" : {
      "timestamp" : 1579778170514,
      "name" : "3e861c6aaa48ef08306ae1cb1391a26e",
      "roles" : [ "data", "ingest" ],
      "os" : {
        "timestamp" : 1579778170514,
        "cpu" : {
          "percent" : 21,
          "load_average" : {
            "1m" : 0.16,
            "5m" : 0.22,
            "15m" : 0.31
          }
        },
        "mem" : {
          "total_in_bytes" : 16823488512,
          "free_in_bytes" : 549515264,
          "used_in_bytes" : 16273973248,
          "free_percent" : 3,
          "used_percent" : 97
        },
        "swap" : {
          "total_in_bytes" : 2147479552,
          "free_in_bytes" : 2135351296,
          "used_in_bytes" : 12128256
        }
      },
      "process" : {
        "timestamp" : 1579778170514,
        "open_file_descriptors" : 55057,
        "max_file_descriptors" : 128000,
        "cpu" : {
          "percent" : 2,
          "total_in_millis" : 197978070
        },
        "mem" : {
          "total_virtual_in_bytes" : 36024020992
        }
      }
    },
    "aE_ZkaTVSxSGdLwWgmemlQ" : {
      "timestamp" : 1579778170513,
      "name" : "4511fd0e24e5159f510327a741ac6156",
      "roles" : [ "master" ],
      "os" : {
        "timestamp" : 1579778170514,
        "cpu" : {
          "percent" : 1,
          "load_average" : {
            "1m" : 0.0,
            "5m" : 0.0,
            "15m" : 0.0
          }
        },
        "mem" : {
          "total_in_bytes" : 3919888384,
          "free_in_bytes" : 194904064,
          "used_in_bytes" : 3724984320,
          "free_percent" : 5,
          "used_percent" : 95
        },
        "swap" : {
          "total_in_bytes" : 2147479552,
          "free_in_bytes" : 2043744256,
          "used_in_bytes" : 103735296
        }
      },
      "process" : {
        "timestamp" : 1579778170514,
        "open_file_descriptors" : 918,
        "max_file_descriptors" : 128000,
        "cpu" : {
          "percent" : 0,
          "total_in_millis" : 20705020
        },
        "mem" : {
          "total_virtual_in_bytes" : 5091344384
        }
      }
    },
    "WfZKQIBBSQeOQHSiFMf7xg" : {
      "timestamp" : 1579778170513,
      "name" : "5839ef32d904f909683c1a2ab5f58322",
      "roles" : [ "data", "ingest" ],
      "os" : {
        "timestamp" : 1579778170514,
        "cpu" : {
          "percent" : 5,
          "load_average" : {
            "1m" : 0.02,
            "5m" : 0.14,
            "15m" : 0.22
          }
        },
        "mem" : {
          "total_in_bytes" : 16823488512,
          "free_in_bytes" : 971194368,
          "used_in_bytes" : 15852294144,
          "free_percent" : 6,
          "used_percent" : 94
        },
        "swap" : {
          "total_in_bytes" : 2147479552,
          "free_in_bytes" : 2137833472,
          "used_in_bytes" : 9646080
        }
      },
      "process" : {
        "timestamp" : 1579778170514,
        "open_file_descriptors" : 59953,
        "max_file_descriptors" : 128000,
        "cpu" : {
          "percent" : 2,
          "total_in_millis" : 209899590
        },
        "mem" : {
          "total_virtual_in_bytes" : 36195016704
        }
      }
    },
    "0PhXU0hsQb6x7pEqHQ3ndw" : {
      "timestamp" : 1579778170514,
      "name" : "f78903d9873fa28deeae67b98923e319",
      "roles" : [ "data", "ingest" ],
      "os" : {
        "timestamp" : 1579778170514,
        "cpu" : {
          "percent" : 17,
          "load_average" : {
            "1m" : 0.17,
            "5m" : 0.21,
            "15m" : 0.25
          }
        },
        "mem" : {
          "total_in_bytes" : 16823488512,
          "free_in_bytes" : 376348672,
          "used_in_bytes" : 16447139840,
          "free_percent" : 2,
          "used_percent" : 98
        },
        "swap" : {
          "total_in_bytes" : 2147479552,
          "free_in_bytes" : 2137329664,
          "used_in_bytes" : 10149888
        }
      },
      "process" : {
        "timestamp" : 1579778170514,
        "open_file_descriptors" : 53475,
        "max_file_descriptors" : 128000,
        "cpu" : {
          "percent" : 2,
          "total_in_millis" : 207187660
        },
        "mem" : {
          "total_virtual_in_bytes" : 36339163136
        }
      }
    }
  }
}

any advice is much appreciated.

The rule of thumb is no more than 20 shards per gb of HEAP.

But how did you get the 325Gb figure from?

25978 shards on 4 data nodes = 6495 shards per node.
You can have around 20 shards per gb of HEAP.

6495 / 20 = around 325 gb

Thank you so much for this.
The cluster seems to be up and running without any issues and each node has:

curl -X GET  'https://es_domain/_cat/nodes?h=heap*&v'
heap.current heap.percent heap.max
         6gb           75    7.9gb
       5.8gb           72    7.9gb
       5.8gb           73    7.9gb
     288.4mb           14    1.9gb
     359.7mb           17    1.9gb
         1gb           53    1.9gb
       5.9gb           75    7.9gb

So my options would be to first reduce the number of shards per node and then increase the number of instances?

Or is there anything else I can do?

Did you watch the resources I shared initially?

It would really help you IMO.

Yeah. In general, having the lowest number of shards is better.

thank you for you help.
i have started to watch the resources

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.