Problems in my Cluster

Hi,

We are experiencing some troubles with our cluster. When we come into the office on monday, one or two of our nodes are gone including the master.
I also get this message in the logs:
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s
From what I read here in the forum that could be because I have to many shards, which is highly possible when I look at my clusterhealth.

{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 4,
"active_primary_shards" : 9100,
"active_shards" : 23225,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

We have in all indices approximately 20 million hits.

I really appreciate any approach on improving the stability of my cluster, because getting my clusterhealth back to green is a pain in the neck

kind regards
Andy

You have far too many shards for a cluster that size. You need to revise you sharing strategy and bring that down by at least an order of magnitude or so. Aim to have an average shard size between a few GB and a few tens of GB.

Where can I check the size of a shard?
And which configuration would you recommend for the case I have, if I am allowed to ask?

you can check shard and index size through the _cat/indices and _cat/shards APIs. What type of data do you have in the cluster? What is your current sharding strategy? If you are using time-based indices, what is your retention period? Which version of Elasticsearch are you using?

Thank you.
Ok my biggest index is something around 18Gb... and some of my shards are around 1,5Gb.
We are using it for Apache logfiles, some windows service logs and since a month or so the output of our docker containers.
We create a new index for everyday, but we have 9 indices.
We are running 5.0.1.
I am not quite sure what you meant with sharing strategy, but if it is the shard and replica config, there it is:

{
  "logstash-2017.05.16" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "5s",
        "number_of_shards" : "5",
        "provided_name" : "logstash-2017.05.16",
        "creation_date" : "1494892819101",
        "number_of_replicas" : "1",
        "uuid" : "pPFz1d3EQEe6XY-dlw344w",
        "version" : {
          "created" : "5000199"
        }
      }
    }
  }
}

That was supposed to be sharding, not sharing. The biggest index seems OK, but probably do not need 5 primary shards. Adjust the number of primary shards and do not use the default of 5 for very small indices. Also consider consolidating small indices and/or using weekly or even monthly indices instead of daily.

If I am not completely wrong I can't change the shard size to anything smaller without removing the index?
But first of all thank you for your help. You already helped me a lot.

As you are on Elasticsearch 5.x, the shrink index API can help you get from 5 to 1 shard per index. You may also be able to reduce the number of replicas you have configured in order to bring the shard count down. Beyond that, and I think you will need to reduce the shard count further than that, you will need to reindex data. This can take time, but do change the settings for newly created indices so that you generate fewer new shards per day right away.

Can you give me advice on reindexing as well? I have never done that before.

You should be able to use the reindex API to do this.

I think this looks a whole lot better.
Thank you again for your help

{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 4,
"active_primary_shards" : 2866,
"active_shards" : 5977,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.