Upgrade ELK Stack from 5.x -> 6.5 - Now have stability issues / errors!

HI all, really would appreciated your help on the following. I have been running an ELK stack very successfully for over 2 years on version 5.x. There was a requirement to upgrade which I have done for all components (Elasticsearch, Logstash and Kibana) which was quite straight forward on centos 7.

Now I have errors in kibana:

"Request to Elasticsearch failed: {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Trying to query 2305 shards, which is over the limit of 2300. This limit exists because querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards. Update [action.search.shard_count.limit] to a greater value if you really want to query that many shards at the same time."}],"type":"illegal_argument_exception","reason":"Trying to query 2305 shards, which is over the limit of 2300. This limit exists because querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards. Update [action.search.shard_count.limit] to a greater value if you really want to query that many shards at the same time."},"status":400}"

I have to admit I'm not an expert on the elasticsearch setup and optimized configuration.

My configuration:

I only have 1 node currently.

output of url -XGET http://localhost:9200/_cluster/settings?pretty :
[root@sc-logs-prd-01 kibana]# curl -XGET http://localhost:9200/_cluster/settings?pretty
{
"persistent" : {
"action" : {
"search" : {
"shard_count" : {
"limit" : "2300"
}
}
}
},
"transient" : { }
}

Output of curl -s http://localhost:9200/_cluster/health?pretty:
{
"cluster_name" : "sc-logger",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 2318,
"active_shards" : 2318,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2046,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 53.11640696608616
}

output of curl -XGET http://localhost:9200/logstash-prd-year-2019.10.14?pretty:
settings" : {
"index" : {
"refresh_interval" : "5s",
"number_of_shards" : "5",
"provided_name" : "logstash-prd-year-2019.10.14",
"creation_date" : "1571011200824",
"number_of_replicas" : "1",
"uuid" : "qoHObP6uT_q26XQq0uGeoA",
"version" : {
"created" : "6050499"

Questions:

  1. I have no idea how to fix this and guessing it's something to do with the upgrade? I don;t want to just keep increasing the amount of shards.
  2. It look like I have 5 shards assigned to the indices - I have no idea if this is too many or not? and if so how I reduce for all historic and new indices.
  3. I only currently have 1 node in the cluster - perhaps I need to increase this?
  4. I have "number of replicas" set to 1 - does this mean the number of nodes or something else?
  5. I notice I have an awful lot of unassigned shards (2046) and don;t know if this is a problem or not, and if it is how I fix it.

Basically I'm in a bit of a mess!, so would really appreciate any help.

Many thanks.

huowen

You have far too many shards for a one-node cluster. Newer versions of Elasticsearch have limits in place to prevent you from the terrible effects of having too many shards. As the message says:

... querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards.

Here is an article with some more detailed recommendations on sharding:

You also have number_of_replicas: 1 on some of your indices, which is an error in a one-node cluster since you can only allocate primaries when you only have one node. You should set number_of_replicas: 0 on all your indices to remove the unassigned shards.

In fact the limit you are currently hitting, action.search.shard_count.limit, was rendered unnecessary in #24012. You can remove it like this:

PUT _cluster/settings
{"persistent":{"action.search.shard_count.limit":null}}

You must still address the enormous excess of shards in your cluster, or else you will hit other limits and issues.

Thank you so much @DavidTurner for taking them time on this. It makes sense what you say. I actually noticed that the cronjob running the curator was failing because it being an old version and therefore not cleaning up the old indices which is why I hit this shard limit suddenly rather than it being the upgrade. That said, I still know this is a temporary fix and need to sort out the root of the issue as you have mentioned above.

Sorry another couple of questions please:

Some context:
The requirement to upgrade was to migrate to another server and thought the easiest way was to upgrade the existing one to 6.5.4 and then migrate using reindex API to the new one on the same version.

  1. In the new cluster is "number_of_replicas: 0" set by default as there is only one node or do I explicitly need to set this in the elasticsearch.yml file?

  2. When I run the reindex I hit the following problem of more than one mapping from the old elastic 5.x data before it got upgrade to version 6.x

Example:

Just to recap, old Elasticsearch 5.x, you can have more than one type per index. 6.x you can only have one type. On the old server, since we updated from 5.x to 6.x it is still permitted to have two types. Reindexing from the upgraded ELK stack to a new server, 2 types is not permitted.

OLD ELK:
{
"logstash-prd-year-2019.01.20": [
"default",
"cisco_app_logger",
"logstash"
]
}

NEW ELK:
{
"logstash-prd-year-2019.01.20": [
"default",
"cisco_app_logger"
]
}

One the old ELK, the index logstash-prd-year-2019.01.20 has two types cisco_app_logger and logstash

On the new ELK, the same index only has cisco_app_logger type

So the question is how to I resolve this for all 480 indices I have or is there a better way to migrate this data over to the new server?

THANK YOU so much.

Huowen.

This is an index setting, so you cannot set it in elasticsearch.yml. You should set it in your index templates so it applies to any newly-created indices, and then set it on each index that currently exists too.

Regarding the removal of multiple mapping types in 6.x, the docs have some suggestions for how you can adjust your architecture to fit into the new one-type-per-index world.

Thanks again @DavidTurner for your quick response.

Forgive me for my lack of knowledge; so do I have to modify ever single index to change the number of replicas? and then set in the index template for any new indices being created? If so any easy way of doing this.

Again for the removal of the multiple mappings do I need to modify every single index? If so any easy way of doing this.

For the migration, is it possible to shutdown elasticsearch and lift and shift the data dir from old server to new?

Thanks again.

huowen

From the docs I shared above:

To update a setting for all indices, use _all or exclude this parameter.

I would update the templates first and then do the indices. If you do it in the other order then you might miss an index that is created between the two steps.

Yes, each reindex job will need to account for the change in mappings.

Yes, that's possible too.

Thanks @DavidTurner, this worked great for updating all indices so now I have zero unassigned shards :slight_smile:

If I want to change the amount of shards per index which is currently 5 being to many I feel - Can I do this is one API call similar to updating the "number_of_replicas" call?

RE: Lifting and shifting the data for the server migration - will I still need to fix the multiple mapping issue I experienced with the reindex?

I'm so grateful for your help and thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.