Failed to execute pipeline for a bulk request

Hello everyone. I have a cluster with 20 data nodes with 4 CPU and 16 GB of RAM. I'm indexing in avg 300 million documents per day (500 GB avg ) into this cluster. The cluster works perfectly during most of the day but every night between 12-2am I receive the following error for index failure and i'm loosing between 3k - 6k messages:

{"type":"es_rejected_execution_exception","reason":"rejected execution of processing of [400745086][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[mainstor_9102][15]] containing [16] requests, target allocation id: dV4A9_upRuG2ysTOUZ-l9Q, primary term: 1 on EsThreadPoolExecutor[name = D0PGnjE/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@fd3d2ed[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 396013837]]"}

Please note that between 11Pm - 4AM I'm sending the lowest amount of documents to the cluster. Any help will be greatly appreciated.
Thank you.

Welcome to our community! :smiley:

Why are you losing them, does your client not retry?

If you can share more about your use case and your client we might be able to provide some advice. However at this stage the first thing that comes to mind is to increase the size of your nodes.

Hello, thank you for replying. My current setup is the following:
-Packetbeat installed on a group of servers
-Packetbeat Data is sent to a kafka queue.
-Elasticsearch running with graylog and graylog is setup as the consumer for the kafka queue.
Graylog is not retrying the messages, that is why I'm losing them.

You will need to speak to the graylog devs about that then sorry, it's outside our stack.

Hello, thank you for the help. The issue I'm worried about is not the fact that we are not retrying the messages, that I will need to check with the graylog team, it is the fact the we are not able to process that bulk of messages even though we are not processing that many requests at the time the issue is happening. I added 5 more nodes to the cluster to see if that made the errors go away but I'm still getting the same issue. Any thoughts why Elasticsearch could not index them ?

What is the output from the _cluster/stats?pretty&human API?

Hello @warkolm this is what I get:

{
  "_nodes" : {
    "total" : 32,
    "successful" : 32,
    "failed" : 0
  },
  "cluster_name" : "graylog",
  "cluster_uuid" : "Nz1qys96TJC03GRDgC5Ekw",
  "timestamp" : 1619707741083,
  "status" : "green",
  "indices" : {
    "count" : 149,
    "shards" : {
      "total" : 1358,
      "primaries" : 1289,
      "replication" : 0.05352986811481769,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 25,
          "avg" : 9.114093959731544
        },
        "primaries" : {
          "min" : 1,
          "max" : 25,
          "avg" : 8.651006711409396
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.46308724832214765
        }
      }
    },
    "docs" : {
      "count" : 8434812284,
      "deleted" : 231075
    },
    "store" : {
      "size" : "13.3tb",
      "size_in_bytes" : 14650759159148
    },
    "fielddata" : {
      "memory_size" : "44.8mb",
      "memory_size_in_bytes" : 47063968,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "2.4gb",
      "memory_size_in_bytes" : 2596493422,
      "total_count" : 94168407,
      "hit_count" : 12818361,
      "miss_count" : 81350046,
      "cache_size" : 17495,
      "cache_count" : 81752,
      "evictions" : 64257
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 13523,
      "memory" : "17.7gb",
      "memory_in_bytes" : 19083626555,
      "terms_memory" : "13.3gb",
      "terms_memory_in_bytes" : 14307013655,
      "stored_fields_memory" : "3.6gb",
      "stored_fields_memory_in_bytes" : 3925116256,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "135.7mb",
      "norms_memory_in_bytes" : 142341504,
      "points_memory" : "512.5mb",
      "points_memory_in_bytes" : 537494720,
      "doc_values_memory" : "163.7mb",
      "doc_values_memory_in_bytes" : 171660420,
      "index_writer_memory" : "1.2gb",
      "index_writer_memory_in_bytes" : 1366313848,
      "version_map_memory" : "11.2mb",
      "version_map_memory_in_bytes" : 11763710,
      "fixed_bit_set" : "2.7mb",
      "fixed_bit_set_memory_in_bytes" : 2848648,
      "max_unsafe_auto_id_timestamp" : 1619654402073,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 32,
      "data" : 25,
      "coordinating_only" : 0,
      "master" : 7,
      "ingest" : 25
    },
    "versions" : [
      "6.8.15",
      "6.8.10",
      "6.8.14",
      "6.8.5"
    ],
    "os" : {
      "available_processors" : 114,
      "allocated_processors" : 114,
      "names" : [
        {
          "name" : "Linux",
          "count" : 32
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 16.04.5 LTS",
          "count" : 29
        },
        {
          "pretty_name" : "Ubuntu 16.04.6 LTS",
          "count" : 3
        }
      ],
      "mem" : {
        "total" : "881.8gb",
        "total_in_bytes" : 946920513536,
        "free" : "85.9gb",
        "free_in_bytes" : 92237348864,
        "used" : "795.9gb",
        "used_in_bytes" : 854683164672,
        "free_percent" : 10,
        "used_percent" : 90
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 203
      },
      "open_file_descriptors" : {
        "min" : 939,
        "max" : 2546,
        "avg" : 1324
      }
    },
    "jvm" : {
      "max_uptime" : "26.9d",
      "max_uptime_in_millis" : 2325380677,
      "versions" : [
        {
          "version" : "1.8.0_292",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.292-b10",
          "vm_vendor" : "Private Build",
          "count" : 5
        },
        {
          "version" : "1.8.0_282",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.282-b08",
          "vm_vendor" : "Private Build",
          "count" : 27
        }
      ],
      "mem" : {
        "heap_used" : "184.9gb",
        "heap_used_in_bytes" : 198548860736,
        "heap_max" : "469gb",
        "heap_max_in_bytes" : 503665000448
      },
      "threads" : 2199
    },
    "fs" : {
      "total" : "24tb",
      "total_in_bytes" : 26486726037504,
      "free" : "10.6tb",
      "free_in_bytes" : 11655318769664,
      "available" : "10.5tb",
      "available_in_bytes" : 11654781898752
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 32
      },
      "http_types" : {
        "security4" : 32
      }
    }
  }
}

Thanks. What's the output from _cat/nodes/?v&h=ip,v?

Hello @warkolm sorry for the delayed response. That query returns the list of ips in my Elasticsearch cluster

It looks like you have nodes of many different versions, which can cause problems as shards can not be relocated from a nodes with a newer version to older ones. I would therefore recommend that you upgrade all nodes to exactly the same version.

It looks like a lot of shards does not have any replica configured, which can cause problems in case of node failures or even transient problems, e.g long GC. I would recommend having replica shards configured at least for all indices you are actively indexing into.

I am not familar with Graylog so can not help there. It would however be useful to know how many indices and shards that you are actively indexing into at any point. Are you using time-based indices with or without rollover? As you are using Elasticsearch 6.8, what is discovery.zen.minimum_master_nodes set to (it should be 4 with 7 master-eligible nodes)?

Hello @Christian_Dahlqvist , the nodes with different version are the master eligible nodes, not the data nodes I will be changing the ES version so they all have the same. All 25 data nodes are the same version. discovery.zen.minimum_master_nodes is set 1 with 7 master eligible nodes. I reached out to the graylog team and they retry the messages. I'm using time based indices with role over, they time set for the role over is 24h. I'm indexing data to 50 indices 1 shard per index with a retention policy of 30 days.

Having minimum_master_nodes set to 1 with 7 master eligible nodes means that your cluster is misconfigured. This could lead to split-brain scenarios that could lead to data loss. You should fix this ASAP.

If you are using daily time-based indices it is possible that a lot of indices would need to be created over a short period of time which could lead to a lot of cluster state updates. If these are slow due to e.g. large cluster state and a large cluster to which the updates need to be propagated it could cause issues around the times when the indices are created. Check if the periods where you are having issues correspond to when new indices are created.

Hello @Christian_Dahlqvist ok, I changed the minimum master nodes to 4. The first thing I did when my issue started was to check if the periods where I was having issues corresponded with the new indices creation and it did not. The new indices are created between 8-8:30pm and I'm getting the errors between 2-3am