Failed to execute pipeline for a bulk request

lacarrillo · April 26, 2021, 7:04pm

Hello everyone. I have a cluster with 20 data nodes with 4 CPU and 16 GB of RAM. I'm indexing in avg 300 million documents per day (500 GB avg ) into this cluster. The cluster works perfectly during most of the day but every night between 12-2am I receive the following error for index failure and i'm loosing between 3k - 6k messages:

{"type":"es_rejected_execution_exception","reason":"rejected execution of processing of [400745086][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[mainstor_9102][15]] containing [16] requests, target allocation id: dV4A9_upRuG2ysTOUZ-l9Q, primary term: 1 on EsThreadPoolExecutor[name = D0PGnjE/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@fd3d2ed[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 396013837]]"}

Please note that between 11Pm - 4AM I'm sending the lowest amount of documents to the cluster. Any help will be greatly appreciated.
Thank you.

warkolm · April 27, 2021, 12:45am

Welcome to our community!

Why are you losing them, does your client not retry?

If you can share more about your use case and your client we might be able to provide some advice. However at this stage the first thing that comes to mind is to increase the size of your nodes.

lacarrillo · April 27, 2021, 8:54pm

Hello, thank you for replying. My current setup is the following:
-Packetbeat installed on a group of servers
-Packetbeat Data is sent to a kafka queue.
-Elasticsearch running with graylog and graylog is setup as the consumer for the kafka queue.
Graylog is not retrying the messages, that is why I'm losing them.

warkolm · April 27, 2021, 11:48pm

You will need to speak to the graylog devs about that then sorry, it's outside our stack.

lacarrillo · April 28, 2021, 3:21pm

Hello, thank you for the help. The issue I'm worried about is not the fact that we are not retrying the messages, that I will need to check with the graylog team, it is the fact the we are not able to process that bulk of messages even though we are not processing that many requests at the time the issue is happening. I added 5 more nodes to the cluster to see if that made the errors go away but I'm still getting the same issue. Any thoughts why Elasticsearch could not index them ?

warkolm · April 28, 2021, 11:34pm

What is the output from the _cluster/stats?pretty&human API?

lacarrillo · April 29, 2021, 2:55pm

Hello @warkolm this is what I get:

{
  "_nodes" : {
    "total" : 32,
    "successful" : 32,
    "failed" : 0
  },
  "cluster_name" : "graylog",
  "cluster_uuid" : "Nz1qys96TJC03GRDgC5Ekw",
  "timestamp" : 1619707741083,
  "status" : "green",
  "indices" : {
    "count" : 149,
    "shards" : {
      "total" : 1358,
      "primaries" : 1289,
      "replication" : 0.05352986811481769,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 25,
          "avg" : 9.114093959731544
        },
        "primaries" : {
          "min" : 1,
          "max" : 25,
          "avg" : 8.651006711409396
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.46308724832214765
        }
      }
    },
    "docs" : {
      "count" : 8434812284,
      "deleted" : 231075
    },
    "store" : {
      "size" : "13.3tb",
      "size_in_bytes" : 14650759159148
    },
    "fielddata" : {
      "memory_size" : "44.8mb",
      "memory_size_in_bytes" : 47063968,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "2.4gb",
      "memory_size_in_bytes" : 2596493422,
      "total_count" : 94168407,
      "hit_count" : 12818361,
      "miss_count" : 81350046,
      "cache_size" : 17495,
      "cache_count" : 81752,
      "evictions" : 64257
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 13523,
      "memory" : "17.7gb",
      "memory_in_bytes" : 19083626555,
      "terms_memory" : "13.3gb",
      "terms_memory_in_bytes" : 14307013655,
      "stored_fields_memory" : "3.6gb",
      "stored_fields_memory_in_bytes" : 3925116256,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "135.7mb",
      "norms_memory_in_bytes" : 142341504,
      "points_memory" : "512.5mb",
      "points_memory_in_bytes" : 537494720,
      "doc_values_memory" : "163.7mb",
      "doc_values_memory_in_bytes" : 171660420,
      "index_writer_memory" : "1.2gb",
      "index_writer_memory_in_bytes" : 1366313848,
      "version_map_memory" : "11.2mb",
      "version_map_memory_in_bytes" : 11763710,
      "fixed_bit_set" : "2.7mb",
      "fixed_bit_set_memory_in_bytes" : 2848648,
      "max_unsafe_auto_id_timestamp" : 1619654402073,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 32,
      "data" : 25,
      "coordinating_only" : 0,
      "master" : 7,
      "ingest" : 25
    },
    "versions" : [
      "6.8.15",
      "6.8.10",
      "6.8.14",
      "6.8.5"
    ],
    "os" : {
      "available_processors" : 114,
      "allocated_processors" : 114,
      "names" : [
        {
          "name" : "Linux",
          "count" : 32
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 16.04.5 LTS",
          "count" : 29
        },
        {
          "pretty_name" : "Ubuntu 16.04.6 LTS",
          "count" : 3
        }
      ],
      "mem" : {
        "total" : "881.8gb",
        "total_in_bytes" : 946920513536,
        "free" : "85.9gb",
        "free_in_bytes" : 92237348864,
        "used" : "795.9gb",
        "used_in_bytes" : 854683164672,
        "free_percent" : 10,
        "used_percent" : 90
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 203
      },
      "open_file_descriptors" : {
        "min" : 939,
        "max" : 2546,
        "avg" : 1324
      }
    },
    "jvm" : {
      "max_uptime" : "26.9d",
      "max_uptime_in_millis" : 2325380677,
      "versions" : [
        {
          "version" : "1.8.0_292",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.292-b10",
          "vm_vendor" : "Private Build",
          "count" : 5
        },
        {
          "version" : "1.8.0_282",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.282-b08",
          "vm_vendor" : "Private Build",
          "count" : 27
        }
      ],
      "mem" : {
        "heap_used" : "184.9gb",
        "heap_used_in_bytes" : 198548860736,
        "heap_max" : "469gb",
        "heap_max_in_bytes" : 503665000448
      },
      "threads" : 2199
    },
    "fs" : {
      "total" : "24tb",
      "total_in_bytes" : 26486726037504,
      "free" : "10.6tb",
      "free_in_bytes" : 11655318769664,
      "available" : "10.5tb",
      "available_in_bytes" : 11654781898752
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 32
      },
      "http_types" : {
        "security4" : 32
      }
    }
  }
}

warkolm · April 30, 2021, 2:46am

Thanks. What's the output from _cat/nodes/?v&h=ip,v?

lacarrillo · May 3, 2021, 4:04pm

Hello @warkolm sorry for the delayed response. That query returns the list of ips in my Elasticsearch cluster

Christian_Dahlqvist · May 3, 2021, 4:17pm

lacarrillo:

"nodes" : {
    "count" : {
      "total" : 32,
      "data" : 25,
      "coordinating_only" : 0,
      "master" : 7,
      "ingest" : 25
    },
    "versions" : [
      "6.8.15",
      "6.8.10",
      "6.8.14",
      "6.8.5"
    ],

It looks like you have nodes of many different versions, which can cause problems as shards can not be relocated from a nodes with a newer version to older ones. I would therefore recommend that you upgrade all nodes to exactly the same version.

It looks like a lot of shards does not have any replica configured, which can cause problems in case of node failures or even transient problems, e.g long GC. I would recommend having replica shards configured at least for all indices you are actively indexing into.

I am not familar with Graylog so can not help there. It would however be useful to know how many indices and shards that you are actively indexing into at any point. Are you using time-based indices with or without rollover? As you are using Elasticsearch 6.8, what is discovery.zen.minimum_master_nodes set to (it should be 4 with 7 master-eligible nodes)?

lacarrillo · May 4, 2021, 4:25pm

Hello @Christian_Dahlqvist , the nodes with different version are the master eligible nodes, not the data nodes I will be changing the ES version so they all have the same. All 25 data nodes are the same version. discovery.zen.minimum_master_nodes is set 1 with 7 master eligible nodes. I reached out to the graylog team and they retry the messages. I'm using time based indices with role over, they time set for the role over is 24h. I'm indexing data to 50 indices 1 shard per index with a retention policy of 30 days.

Christian_Dahlqvist · May 4, 2021, 4:39pm

Having minimum_master_nodes set to 1 with 7 master eligible nodes means that your cluster is misconfigured. This could lead to split-brain scenarios that could lead to data loss. You should fix this ASAP.

If you are using daily time-based indices it is possible that a lot of indices would need to be created over a short period of time which could lead to a lot of cluster state updates. If these are slow due to e.g. large cluster state and a large cluster to which the updates need to be propagated it could cause issues around the times when the indices are created. Check if the periods where you are having issues correspond to when new indices are created.

lacarrillo · May 4, 2021, 5:43pm

Hello @Christian_Dahlqvist ok, I changed the minimum master nodes to 4. The first thing I did when my issue started was to check if the periods where I was having issues corresponded with the new indices creation and it did not. The new indices are created between 8-8:30pm and I'm getting the errors between 2-3am

system · June 1, 2021, 5:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Attempt to index a large dataset fails Elasticsearch	12	530	July 6, 2017
Failed to execute pipeline for a bulk using ingest node Elasticsearch	7	5808	April 24, 2017
Missing documents after bulk indexing Elasticsearch	3	1762	July 6, 2017
Rejected execution (queue capacity 50) in bulk process Elasticsearch	11	3611	July 6, 2017
Elasticsearch error docs Elasticsearch	8	454	July 6, 2017

Failed to execute pipeline for a bulk request

Related topics