7.0 Upgrade Assistant open shards


(Vam Pikmin) #1

Number of open shards exceeds cluster soft limit

warning

Documentation

There are [9374] open shards in this cluster, but the cluster is limited to [1000] per data node, for [1000] maximum.

Hi everyone,
I'm using only one host and four indexes, ingesting winlogbeat, netflow and syslog data

Out of the three winlogbeat and netflow can get up to 1GB a day

The server is running on a Linux VM under ESXi, what would be a best practise to make it more responsive if I had to chose between allocating more resources to the existing VM or create new VMs and make them replicas knowing that it would still use the same storage?

Thanks


(Gordon Brown) #2

Hi!

First off, the documentation link is broken - that's a known issue for Elasticsearch 6.6, and will be fixed in an upcoming release. You can find the correct documentation here.

Are you certain that you're only using four indices? For example, by default, winlogbeat will create a new index each day, with the name pattern metricbeat-<version>-yyyy.MM.dd. You can see all your indices by running GET _cat/indices?v.

Assuming your setup does indeed have 9374 shards on one node, I'm a bit surprised you haven't already had significant problems. Instead of allocating more resources to one node or adding more nodes, I'd suggest you invest some time in

  1. Adjusting your Beats' index templates (and other ingest sources) configuration to create fewer shards,
  2. Deleting some old indices, if possible,
  3. Using the Shrink API to reduce the number of shards you have,
  4. Using Reindexing to collect the data from many small indices into a smaller number of larger indices.

The 1000 shard/data node limit being introduced in Elasticsearch 7.0 is intended as a safety limit, as we typically see increased instability in clusters that have very high shard counts, and it sounds like the amount of data you're ingesting could easily be done with a much lower shard count, which would make more efficient use of the resources you already have.

You can find information about shard sizing and overhead in this blog post as well.


(Vam Pikmin) #3

Hi Gordon,
Thanks very much for your advice.
I have noticed the CPU usage is rarely below 50% and large queries in Kibana take a while to process.
I'm not sure about the exact wording, when I said four indexes I meant I have four patterns?
logstash
netflow
cisco-asa
winlogbeat
Like you said each of these do have multiple files (shards?) separated in days, for today it looks like this

>     green  open .monitoring-es-6-2019.03.15       TptkVm32RgmNfkoFy9Hcmw 1 0 4266355 134071   2.3gb   2.3gb
>     yellow open cisco-asa-2019.03.15              9nP54981RGC2A7XrGhcIVg 5 1  157585      0  71.1mb  71.1mb
>     yellow open netflow-2019.03.15                O8Vw1X8HTYW8nVyjtHZmZQ 5 1 7276889      0   1.8gb   1.8gb
>     yellow open winlogbeat-6.6.1-2019.03.15       Cmi0nRHgQsu6OM7-NOMxwA 5 1  299917      0 302.2mb 302.2mb
>     yellow open logstash-2019.03.15               nXTQFOWtQoOE47Vvy61neA 5 1    1157      0 787.3kb 787.3kb
>     green  open .monitoring-logstash-6-2019.03.15 lNpRb_g7Qwa6XA8zv_1mSA 1 0  509288      0  94.8mb  94.8mb
>     green  open .monitoring-kibana-6-2019.03.15   rFBTfa9jRhWAEOXExK_GLQ 1 0    8640      0   2.2mb   2.2mb

When I run
curl -XGET localhost:9200/logstash*?pretty

"settings" : {
"index" : {
"refresh_interval" : "5s",
"number_of_shards" : "5",
"provided_name" : "logstash-2019.03.19",
"creation_date" : "1552954079105",
"number_of_replicas" : "1",
"uuid" : "BT7wVtIESB2U0XgErtDLVA",
"version" : {
"created" : "6060099"

Does that mean I have 5 shards per index?

I have been deleting netflow indices older than 2 months. I have been keeping the rest for now.

I will read the articles you linked, I'm having a bit of trouble grasping it.

Anything older than a month, if I want to keep I should somehow close so it's not using any resources? but then if I need to go back I can open?

Thank you


(Vam Pikmin) #4

Do you have any examples for me to test the reindex?

Can I specify a from and to date when I use the reindex command or does it always have to be run on all of the indices (logstash-*)
Will the new index be a single logstash-copy, or will this be a copy of the data per day as before?
Sorry for asking stupid questions

I've worked out the close command finally.. This is after using logstash for over a year. Some of us are slow

I've started with this command now as a test, combine old daily into month:

curl -XPOST localhost:9200/_reindex -d '{"source": {"index": "cisco-asa-2018.06.*"}, "dest": {"index": "cisco-asa-2018.06"} }' -H 'Content-Type: application/json'