Process Rejection Question

Hi, we're operating an ELK cluster on a single host. The current setup includes:
3 node ( 1 master-nodata, 2 master-data) containers
2 logstash containers (for app and for container logs)
2 kibana containers(1 for administrative purposes only)

I noticed that we get process rejections on write threadpool only on node 3 container. If I understood it correctly this happens when cpu cant handle the queue. We are using 2x E5 2620 intel xeon processors on this host. Each node has 16GB heap, each logstash has 4gb heap and there is also plenty of ram left after logstash and elasticsearch heaps.

I have more shards than i normally should have so i suppose that might be causing a problem on these nodes in particular in this subject. We changed our indexing system but it will take about a month until it settles. I've thought of a couple solutions but not sure how they will work so i'd like your knowledge on this.

  1. If i add 2 more data node containers to this cluster and keep the total shard count for each index at 2 will it make some room as it begins to spread the shards evenly to all nodes?
  2. Is it possible to increase the thread count for write threadpool? If yes where? I dont think increasing or decreasing queue size might effect this in a positive way. So I'd like to know if i can change thread count directly.
  3. When the new index structure settles as i mentioned above, will it affect this situation in a positive way?

Thanks in advance.

What is the full output of the cluster stats API? Which version of Elasticsearch are you using? How many indices and shards are you actively indexing into? How did this change based on your altered shading strategy?

We had around 90 indices rolling over daily. Now half of those indices are on monthly rolls, more than half of the second half is on weekly rolls and the rest are on daily rolls since they are huge indices. I also reindexed very small indices to one single index to decrease the count but for most of the indices i must wait for them to just get deleted when their lifecycle completes.

Clustar stats output:

"status" : "green",
  "indices" : {
    "count" : 1862,
    "shards" : {
      "total" : 3724,
      "primaries" : 1862,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 2,
          "avg" : 2.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 6352361386,
      "deleted" : 566326
    },
    "store" : {
      "size_in_bytes" : 2528196733514
    },
    "fielddata" : {
      "memory_size_in_bytes" : 158088104,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 195442208,
      "total_count" : 15025463,
      "hit_count" : 1801538,
      "miss_count" : 13223925,
      "cache_size" : 16991,
      "cache_count" : 59037,
      "evictions" : 42046
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 20331,
      "memory_in_bytes" : 192590236,
      "terms_memory_in_bytes" : 138837504,
      "stored_fields_memory_in_bytes" : 23589352,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 17520448,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 12642932,
      "index_writer_memory_in_bytes" : 3371199768,
      "version_map_memory_in_bytes" : 354,
      "fixed_bit_set_memory_in_bytes" : 14168,
      "max_unsafe_auto_id_timestamp" : 1615876202047,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "boolean",
          "count" : 1877,
          "index_count" : 1167
        },
        {
          "name" : "date",
          "count" : 2226,
          "index_count" : 1859
        },
        {
          "name" : "double",
          "count" : 17,
          "index_count" : 5
        },
        {
          "name" : "float",
          "count" : 14,
          "index_count" : 14
        },
        {
          "name" : "geo_point",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "integer",
          "count" : 816,
          "index_count" : 70
        },
        {
          "name" : "ip",
          "count" : 2,
          "index_count" : 1
        },
        {
          "name" : "keyword",
          "count" : 28151,
          "index_count" : 1862
        },
        {
          "name" : "long",
          "count" : 6776,
          "index_count" : 1462
        },
        {
          "name" : "nested",
          "count" : 75,
          "index_count" : 73
        },
        {
          "name" : "object",
          "count" : 4359,
          "index_count" : 1229
        },
        {
          "name" : "text",
          "count" : 28246,
          "index_count" : 1860
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "coordinating_only" : 0,
      "data" : 2,
      "ingest" : 3,
      "master" : 3,
      "remote_cluster_client" : 3
    },
    "versions" : [
      "7.7.0"
    ],
    "os" : {
      "available_processors" : 3,
      "allocated_processors" : 3,
      "names" : [
        {
          "name" : "Linux",
          "count" : 3
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 3
        }
      ],
      "mem" : {
        "total_in_bytes" : 607702573056,
        "free_in_bytes" : 368752975872,
        "used_in_bytes" : 238949597184,
        "free_percent" : 61,
        "used_percent" : 39
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 8
      },
      "open_file_descriptors" : {
        "min" : 342,
        "max" : 6652,
        "avg" : 4541
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 6610367046,
      "versions" : [
        {
          "version" : "12.0.2",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "12.0.2+10",
          "vm_vendor" : "Oracle Corporation",
          "bundled_jdk" : true,
          "using_bundled_jdk" : false,
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 17384111424,
        "heap_max_in_bytes" : 51513458688
      },
      "threads" : 404
    },
    "fs" : {
      "total_in_bytes" : 71984744497152,
      "free_in_bytes" : 53391446179840,
      "available_in_bytes" : 53391446179840
    },
    "plugins" : [
      {
        "name" : "opendistro_alerting",
        "version" : "1.8.0.0",
        "elasticsearch_version" : "7.7.0",
        "java_version" : "1.8",
        "description" : "Amazon OpenDistro alerting plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.alerting.AlertingPlugin",
        "extended_plugins" : [
          "lang-painless"
        ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_performance_analyzer",
        "version" : "1.8.0.0",
        "elasticsearch_version" : "7.7.0",
        "java_version" : "1.8",
        "description" : "Performance Analyzer Plugin",
        "classname" : "com.amazon.opendistro.elasticsearch.performanceanalyzer.PerformanceAnalyzerPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-knn",
        "version" : "1.8.0.0",
        "elasticsearch_version" : "7.7.0",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch KNN",
        "classname" : "com.amazon.opendistroforelasticsearch.knn.plugin.KNNPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_security",
        "version" : "1.8.0.0",
        "elasticsearch_version" : "7.7.0",
        "java_version" : "1.8",
        "description" : "Provide access control related features for Elasticsearch 7",
        "classname" : "com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-job-scheduler",
        "version" : "1.8.0.0",
        "elasticsearch_version" : "7.7.0",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch job schduler plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.jobscheduler.JobSchedulerPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_sql",
        "version" : "1.8.0.0",
        "elasticsearch_version" : "7.7.0",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch SQL",
        "classname" : "com.amazon.opendistroforelasticsearch.sql.plugin.SqlPlug",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-anomaly-detection",
        "version" : "1.8.0.0",
        "elasticsearch_version" : "7.7.0",
        "java_version" : "1.8",
        "description" : "Amazon opendistro elasticsearch anomaly detector plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.ad.AnomalyDetectorPlugin",
        "extended_plugins" : [
          "lang-painless",
          "opendistro-job-scheduler"
        ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_index_management",
        "version" : "1.8.0.0",
        "elasticsearch_version" : "7.7.0",
        "java_version" : "1.8",
        "description" : "Open Distro Index State Management Plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.indexstatemanagement.IndexStateManagementPlugin",
        "extended_plugins" : [
          "opendistro-job-scheduler"
        ],
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : {
        "com.amazon.opendistroforelasticsearch.security.ssl.http.netty.OpenDistroSecuritySSLNettyTransport" : 3
      },
      "http_types" : {
        "com.amazon.opendistroforelasticsearch.security.http.OpenDistroSecurityHttpServerTransport" : 3
      }
    },
    "discovery_types" : {
      "zen" : 3
    },
    "packaging_types" : [
      {
        "flavor" : "oss",
        "type" : "tar",
        "count" : 3
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 0,
      "processor_stats" : { }
    }
  }

}

Rather than switching small indices to monthly it would probably be better to consolidate them into fewer indices as that would reduce the number of shards written to.

How are you indexing data into Elasticsearch? Beats? Logstash?

On the first part, yes we are also merging related indices to a single index + keep them as monthly/weekly etc. The overall process is designed to decrease active shard count to around 1000 when it completely settles.

We are indexing data by logstash, one for application logs and one for docker container logs by using gelf input.