Queuing tasks and snapshot operation not working

Hi,
I has the cluster elasticsearch with 6 data nodes and 3 master.
When execute the snapshot I receive the error "process_cluster_event_timeout_exception".

I look in my cluster "/_cat/pending_tasks" it has 69 tasks with priority HIGH and source put-mapping

My cluster is for centralized log and have this process to put data in cluster:

  • logstash - collect from Redis and put to Elasticsearch
  • apm-server
  • filebeat
  • metricbeat

I stay removing beats and some applications from apm-server

Has as change priority for create_snapshot from NORMAL to HIGH?
It is not a solution, how to I check the correct size for my cluster?

*Normally i keep 7 days the indice in my cluster because the backup.
But because the error, I removed the process to exclude the old data

GET _cat/nodes?v&s=node.role:desc

ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.2.8 47 50 0 0.00 0.00 0.00 mi - prd-elasticsearch-i-020
10.0.0.7 14 50 0 0.00 0.00 0.00 mi - prd-elasticsearch-i-0ab
10.0.1.1 47 77 29 1.47 1.72 1.66 mi * prd-elasticsearch-i-0e2
10.0.2.7 58 95 19 8.04 8.62 8.79 d - prd-elasticsearch-i-0b4
10.0.2.4 59 97 20 8.22 8.71 8.76 d - prd-elasticsearch-i-00d
10.0.1.6 62 94 38 11.42 8.87 8.89 d - prd-elasticsearch-i-0ff
10.0.0.6 67 97 25 8.97 10.45 10.47 d - prd-elasticsearch-i-01a
10.0.0.9 57 98 32 11.63 9.64 9.17 d - prd-elasticsearch-i-005
10.0.1.0 62 96 19 10.45 9.53 9.31 d - prd-elasticsearch-i-088

My cluster definitions:

     {
      "_nodes": {
        "total": 9,
        "successful": 9,
        "failed": 0
      },
      "cluster_name": "prd-elasticsearch",
      "cluster_uuid": "xxxx",
      "timestamp": 1607609607018,
      "status": "green",
      "indices": {
        "count": 895,
        "shards": {
          "total": 14006,
          "primaries": 4700,
          "replication": 1.98,
          "index": {
            "shards": {
              "min": 2,
              "max": 18,
              "avg": 15.649162011173184
            },
            "primaries": {
              "min": 1,
              "max": 6,
              "avg": 5.251396648044692
            },
            "replication": {
              "min": 1,
              "max": 2,
              "avg": 1.9787709497206705
            }
          }
        },
        "docs": {
          "count": 14896803950,
          "deleted": 843126
        },
        "store": {
          "size_in_bytes": 16778620001453
        },
        "fielddata": {
          "memory_size_in_bytes": 4790672272,
          "evictions": 0
        },
        "query_cache": {
          "memory_size_in_bytes": 7689832903,
          "total_count": 2033762560,
          "hit_count": 53751516,
          "miss_count": 1980011044,
          "cache_size": 4087727,
          "cache_count": 11319866,
          "evictions": 7232139
        },
        "completion": {
          "size_in_bytes": 0
        },
        "segments": {
          "count": 155344,
          "memory_in_bytes": 39094918196,
          "terms_memory_in_bytes": 31533157295,
          "stored_fields_memory_in_bytes": 5574613712,
          "term_vectors_memory_in_bytes": 0,
          "norms_memory_in_bytes": 449973760,
          "points_memory_in_bytes": 886771949,
          "doc_values_memory_in_bytes": 650401480,
          "index_writer_memory_in_bytes": 905283962,
          "version_map_memory_in_bytes": 1173400,
          "fixed_bit_set_memory_in_bytes": 12580800,
          "max_unsafe_auto_id_timestamp": 1607606224903,
          "file_sizes": {}
        }
      },
      "nodes": {
        "count": {
          "total": 9,
          "data": 6,
          "coordinating_only": 0,
          "master": 3,
          "ingest": 3
        },
        "versions": [
          "6.8.1"
        ],
        "os": {
          "available_processors": 108,
          "allocated_processors": 108,
          "names": [
            {
              "name": "Linux",
              "count": 9
            }
          ],
          "pretty_names": [
            {
              "pretty_name": "CentOS Linux 7 (Core)",
              "count": 9
            }
          ],
          "mem": {
            "total_in_bytes": 821975162880,
            "free_in_bytes": 50684043264,
            "used_in_bytes": 771291119616,
            "free_percent": 6,
            "used_percent": 94
          }
        },
        "process": {
          "cpu": {
            "percent": 349
          },
          "open_file_descriptors": {
            "min": 429,
            "max": 9996,
            "avg": 6607
          }
        },
        "jvm": {
          "max_uptime_in_millis": 43603531934,
          "versions": [
            {
              "version": "1.8.0_222",
              "vm_name": "OpenJDK 64-Bit Server VM",
              "vm_version": "25.222-b10",
              "vm_vendor": "Oracle Corporation",
              "count": 9
            }
          ],
          "mem": {
            "heap_used_in_bytes": 137629451248,
            "heap_max_in_bytes": 205373571072
          },
          "threads": 1941
        },
        "fs": {
          "total_in_bytes": 45245361229824,
          "free_in_bytes": 28231010959360,
          "available_in_bytes": 28231011147776
        },
        "plugins": [
          {
            "name": "repository-s3",
            "version": "6.8.1",
            "elasticsearch_version": "6.8.1",
            "java_version": "1.8",
            "description": "The S3 repository plugin adds S3 repositories",
            "classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
            "extended_plugins": [],
            "has_native_controller": false
          }
        ],
        "network_types": {
          "transport_types": {
            "security4": 9
          },
          "http_types": {
            "security4": 9
          }
        }
      }
    }

This sounds like the problem, you're updating mappings too often. The first thing I'd suggest is to stop doing that.

Hi @DavidTurner, thank's for your update.
How to stop or minimize the process put-mapping?
I using logstash, metricbeat, filebeat to put data.

How to identify the process cause the put-mapping?

For example:
GET /_tasks?detailed=true

 "1EWDKShaR4muQ2_6KCd2dg:1059431559" : {
          "node" : "1EWDKShaR4muQ2_6KCd2dg",
          "id" : 1059431559,
          "type" : "netty",
          "action" : "indices:admin/mapping/put",
          "description" : "",
          "start_time_in_millis" : 1607611398514,
          "running_time_in_nanos" : 608622968,
          "cancellable" : false,
          "parent_task_id" : "y0uMNye4Sc-y5lM0p59G6Q:513942027",
          "headers" : { }
        },
        "1EWDKShaR4muQ2_6KCd2dg:1059431558" : {
          "node" : "1EWDKShaR4muQ2_6KCd2dg",
          "id" : 1059431558,
          "type" : "netty",
          "action" : "indices:admin/mapping/put",
          "description" : "",
          "start_time_in_millis" : 1607611398498,
          "running_time_in_nanos" : 624527171,
          "cancellable" : false,
          "parent_task_id" : "Ny8ihNH2SO-qxXROkZj4-A:575706969",
          "headers" : { }
        },

Good question. Check GET /_cluster/pending_tasks?pretty&human, I think this identifies the index whose mapping is being updated.

Hi @DavidTurner, thank's for your update,

For this command I receive that result:

{
  "tasks" : [
    {
      "insert_order" : 268970200,
      "priority" : "HIGH",
      "source" : "put-mapping",
      "executing" : true,
      "time_in_queue_millis" : 1453,
      "time_in_queue" : "1.4s"
    },
    {
      "insert_order" : 268970201,
      "priority" : "HIGH",
      "source" : "put-mapping",
      "executing" : false,
      "time_in_queue_millis" : 1451,
      "time_in_queue" : "1.4s"
    },
....

Oh, that's not helpful. What version is this?

Hi, @DavidTurner,
My version is 6.8.1

Ah ok we only added the index name to the output in https://github.com/elastic/elasticsearch/pull/52690 (i.e. 7.7.0). I don't have any other great suggestions for such an old version as 6.8.1, sorry. It could well be any of the clients you listed.

Understand this possible.
Thank's