Data nodes not ingesting new documents for over 10 min

Hi, I've got about 10 data nodes (this value fluctuates throughout the day) that are triggering my "no new documents" alert, that'll trigger if a node doesn't ingest documents for over 10 minutes.
When checking the logs on these nodes they all have the last attempted "BulkShardRequest" in common, on the 9th Jan, for an index that doesn't seem to exist ".monitoring-es-7-2024.01.09". I can see however the monitoring indexes from the 13/01/2024 until 19/01/2024.
The stack trace looks something like this

...
{"type": "server", "timestamp": "2024-01-09T09:43:31,959Z", "level": "WARN", "component": "o.e.x.m.e.l.LocalExporter", "cluster.name": "elasticsearch", "node.name": "elasticsearch-data-8", "message": "unexpected error while indexing monitoring document", "cluster.uuid": "**", "node.id": "**" ,1/9/2024 9:43:31 AM "stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-7-2024.01.09][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2024.01.09][0]] containing [index {[.monitoring-es-7-2024.01.09][_doc]
...
"Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-es-7-2024.01.09][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2024.01.09][0]] containing [index {[.monitoring-es-7-2024.01.09][_doc].[**],  

This cluster has been under disk usage pressure recently and had more data nodes joining the cluster on that day (9th Jan).
I have searched online but didn't see much information on how to diagnose the reason for the files not being ingested for a window of time, could someone share some pointers on this?

Thanks

What is the full output of the cluster stats API?

Hi @Christian_Dahlqvist , here it is:

{
  "_nodes" : {
    "total" : 50,
    "successful" : 50,
    "failed" : 0
  },
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "**",
  "timestamp" : 1705914037391,
  "status" : "green",
  "indices" : {
    "count" : 2033,
    "shards" : {
      "total" : 17304,
      "primaries" : 5823,
      "replication" : 1.9716640906749099,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 9,
          "avg" : 8.511559272011805
        },
        "primaries" : {
          "min" : 1,
          "max" : 3,
          "avg" : 2.864240039350713
        },
        "replication" : {
          "min" : 1.0,
          "max" : 2.0,
          "avg" : 1.927693064436793
        }
      }
    },
    "docs" : {
      "count" : 78559769016,
      "deleted" : 8368870
    },
    "store" : {
      "size" : "121.3tb",
      "size_in_bytes" : 133450966135498,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "335.3mb",
      "memory_size_in_bytes" : 351655400,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "1.9gb",
      "memory_size_in_bytes" : 2078123530,
      "total_count" : 10179686920,
      "hit_count" : 1278113058,
      "miss_count" : 8901573862,
      "cache_size" : 758575,
      "cache_count" : 5395156,
      "evictions" : 4636581
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 335828,
      "memory" : "12.1gb",
      "memory_in_bytes" : 12992350216,
      "terms_memory" : "11.7gb",
      "terms_memory_in_bytes" : 12592890848,
      "stored_fields_memory" : "222.5mb",
      "stored_fields_memory_in_bytes" : 233339840,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "1.3mb",
      "norms_memory_in_bytes" : 1381568,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "157.1mb",
      "doc_values_memory_in_bytes" : 164737960,
      "index_writer_memory" : "745.1mb",
      "index_writer_memory_in_bytes" : 781372528,
      "version_map_memory" : "363.7kb",
      "version_map_memory_in_bytes" : 372492,
      "fixed_bit_set" : "112.2mb",
      "fixed_bit_set_memory_in_bytes" : 117686024,
      "max_unsafe_auto_id_timestamp" : 1705903294120,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 484,
          "index_count" : 22
        },
        {
          "name" : "binary",
          "count" : 9,
          "index_count" : 1
        },
        {
          "name" : "boolean",
          "count" : 2622,
          "index_count" : 288
        },
        {
          "name" : "byte",
          "count" : 68,
          "index_count" : 68
        },
        {
          "name" : "constant_keyword",
          "count" : 39,
          "index_count" : 13
        },
        {
          "name" : "date",
          "count" : 6338,
          "index_count" : 2032
        },
        {
          "name" : "double",
          "count" : 323,
          "index_count" : 22
        },
        {
          "name" : "flattened",
          "count" : 363,
          "index_count" : 23
        },
        {
          "name" : "float",
          "count" : 1361,
          "index_count" : 655
        },
        {
          "name" : "geo_point",
          "count" : 2093,
          "index_count" : 1917
        },
        {
          "name" : "half_float",
          "count" : 3846,
          "index_count" : 1909
        },
        {
          "name" : "integer",
          "count" : 181,
          "index_count" : 9
        },
        {
          "name" : "ip",
          "count" : 3316,
          "index_count" : 1917
        },
        {
          "name" : "keyword",
          "count" : 258524,
          "index_count" : 2032
        },
        {
          "name" : "long",
          "count" : 19320,
          "index_count" : 1739
        },
        {
          "name" : "nested",
          "count" : 228,
          "index_count" : 35
        },
        {
          "name" : "object",
          "count" : 54611,
          "index_count" : 2033
        },
        {
          "name" : "scaled_float",
          "count" : 13,
          "index_count" : 13
        },
        {
          "name" : "short",
          "count" : 1045,
          "index_count" : 77
        },
        {
          "name" : "text",
          "count" : 205381,
          "index_count" : 1737
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 50,
      "coordinating_only" : 3,
      "data" : 44,
      "data_cold" : 44,
      "data_content" : 44,
      "data_hot" : 44,
      "data_warm" : 44,
      "ingest" : 44,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "transform" : 44,
      "voting_only" : 0
    },
    "versions" : [
      "7.10.2"
    ],
    "os" : {
      "available_processors" : 728,
      "allocated_processors" : 728,
      "names" : [
        {
          "name" : "Linux",
          "count" : 50
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 8",
          "count" : 50
        }
      ],
      "mem" : {
        "total" : "952gb",
        "total_in_bytes" : 1022202216448,
        "free" : "15gb",
        "free_in_bytes" : 16172204032,
        "used" : "936.9gb",
        "used_in_bytes" : 1006030012416,
        "free_percent" : 2,
        "used_percent" : 98
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 92
      },
      "open_file_descriptors" : {
        "min" : 1446,
        "max" : 5539,
        "avg" : 4903
      }
    },
    "jvm" : {
      "max_uptime" : "179.2d",
      "max_uptime_in_millis" : 15483945118,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 50
        }
      ],
      "mem" : {
        "heap_used" : "344.1gb",
        "heap_used_in_bytes" : 369518162648,
        "heap_max" : "761gb",
        "heap_max_in_bytes" : 817117528064
      },
      "threads" : 6653
    },
    "fs" : {
      "total" : "172.8tb",
      "total_in_bytes" : 190091910139904,
      "free" : "50.3tb",
      "free_in_bytes" : 55367672315904,
      "available" : "50.3tb",
      "available_in_bytes" : 55366883786752
    },
    "plugins" : [
      {
        "name" : "discovery-ec2",
        "version" : "7.10.2",
        "elasticsearch_version" : "7.10.2",
        "java_version" : "1.8",
        "description" : "The EC2 discovery plugin allows to use AWS API for the unicast discovery mechanism.",
        "classname" : "org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "repository-s3",
        "version" : "7.10.2",
        "elasticsearch_version" : "7.10.2",
        "java_version" : "1.8",
        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : {
        "security4" : 50
      },
      "http_types" : {
        "security4" : 50
      }
    },
    "discovery_types" : {
      "zen" : 50
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 50
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}


I have a few observations about your cluster:

This is a very old version that has been EOL a long time. I recommend you upgrade at least version 7.17. There are a lot of fixes and performance improvements that you are missing.

When running Elasticsearch in production the best practice is to assign no more than 50% of the available RAM available for Elasticsearch on the host to the heap. It looks like your ratio is 80% rather than 50%, which can cause a lot of problems.

It looks like each data node holds almost 2.8TB of data with only around 20GB of RAM. That is 140GB of data on disk for each GB of RAM. This is a very high ratio and can be causing problems, especially on such an old version.

I do have a few additional questions:

  • How evenly is data spread out across the data nodes?
  • Are there nodes that have a larger portion of high-traffic indices than others? If so, does this correlate to the nodes having issues?
  • What type of storage are you using?

Hi @Christian_Dahlqvist , thanks for your reply.

  • Upgrade to 7.17: yes, we've got a short term plan with this cluster but we believe a rolling upgrade should be worth the work so it's in the pipeline.

  • Heap max: I don't think this was defined for this cluster, checking the docs it looks like heap size settings are automaticaly sized by ES? Appreciate if you could point me out to some docs I might be missing.

Now answering your qs:

  • How evenly is data spread out across the data nodes?
    It appears to be well spread, we've got about 400 shards per data node and they use 2.5-3tb of disk each. And the ones I've got alerting are not using more of their disk space

  • Are there nodes that have a larger portion of high-traffic indices than others? If so, does this correlate to the nodes having issues?
    I've been through different stats we've got for the nodes and in all stats we've got usage well spread between them, unfortunately there's no obvious correlation between current stats and nodes.

  • What type of storage are you using?
    We're using EBS gp2 volumes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.