Elastic stuck and flooded with "monitoring execution failed"

Hello,
ES 5.6.10 under k8s cluster v1.9
xpack.security disabled
16 data nodes
Large number of unassigned shards that do not go down at all.

[2020-09-08T11:15:40,655][WARN ][o.e.x.m.MonitoringService] [es-data05-0] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.<init>(ExportBulk.java:106) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:104) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:217) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBu

Please help what can be checked.
Thanks in advance

1 Like

5.X is EOL, you should really upgrade as a matter of urgency.

We are working with Graylog 2.4.6-1 which requires ES 5.
Is it possible to migrate data from ES 5 to ES 6?

What is the full output of the cluster stats API?

{
  "_nodes" : {
    "total" : 19,
    "successful" : 19,
    "failed" : 0
  },
  "cluster_name" : "graylog-es",
  "timestamp" : 1599627091279,
  "status" : "red",
  "indices" : {
    "count" : 121,
    "shards" : {
      "total" : 1716,
      "primaries" : 1090,
      "replication" : 0.5743119266055046,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 28,
          "avg" : 14.181818181818182
        },
        "primaries" : {
          "min" : 1,
          "max" : 16,
          "avg" : 9.008264462809917
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.6666666666666667,
          "avg" : 0.6750140126173181
        }
      }
    },
    "docs" : {
      "count" : 10043652598,
      "deleted" : 1897511
    },
    "store" : {
      "size_in_bytes" : 18682197963379,
      "throttle_time_in_millis" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 5817379,
      "total_count" : 40108822,
      "hit_count" : 38750480,
      "miss_count" : 1358342,
      "cache_size" : 84688,
      "cache_count" : 163098,
      "evictions" : 78410
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 4012,
      "memory_in_bytes" : 18409026900,
      "terms_memory_in_bytes" : 12330196178,
      "stored_fields_memory_in_bytes" : 5292088712,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 1138176,
      "points_memory_in_bytes" : 430881562,
      "doc_values_memory_in_bytes" : 354722272,
      "index_writer_memory_in_bytes" : 9037368,
      "version_map_memory_in_bytes" : 492,
      "fixed_bit_set_memory_in_bytes" : 1552,
      "max_unsafe_auto_id_timestamp" : 1599626782852,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 19,
      "data" : 16,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 16
    },
    "versions" : [
      "5.6.10"
    ],
    "os" : {
      "available_processors" : 1104,
      "allocated_processors" : 608,
      "names" : [
        {
          "name" : "Linux",
          "count" : 19
        }
      ],
      "mem" : {
        "total_in_bytes" : 10273935179776,
        "free_in_bytes" : 1585044611072,
        "used_in_bytes" : 8688890568704,
        "free_percent" : 15,
        "used_percent" : 85
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 6
      },
      "open_file_descriptors" : {
        "min" : 929,
        "max" : 3900,
        "avg" : 2034
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 131278055,
      "versions" : [
        {
          "version" : "1.8.0_171",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.171-b10",
          "vm_vendor" : "Oracle Corporation",
          "count" : 19
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 81816654152,
        "heap_max_in_bytes" : 380166144000
      },
      "threads" : 5435
    },
    "fs" : {
      "total_in_bytes" : 39013566181376,
      "free_in_bytes" : 12496509353984,
      "available_in_bytes" : 12496509353984,
      "spins" : "true"
    },
    "plugins" : [
      {
        "name" : "ingest-user-agent",
        "version" : "5.6.10",
        "description" : "Ingest processor that extracts information from a user agent",
        "classname" : "org.elasticsearch.ingest.useragent.IngestUserAgentPlugin",
        "has_native_controller" : false
      },
      {
        "name" : "ingest-geoip",
        "version" : "5.6.10",
        "description" : "Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database",
        "classname" : "org.elasticsearch.ingest.geoip.IngestGeoIpPlugin",
        "has_native_controller" : false
      },
      {
        "name" : "x-pack",
        "version" : "5.6.10",
        "description" : "Elasticsearch Expanded Pack Plugin",
        "classname" : "org.elasticsearch.xpack.XPackPlugin",
        "has_native_controller" : true
      }
    ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 19
      },
      "http_types" : {
        "netty4" : 19
      }
    }
  }
}

As far as I can see that does not look too bad. Have you looked at the logs to try and identify what happened that led to the shards being unassigned? I have no experience with Graylog, but would recommend you try to upgrade.

It's running on vanilla k8s, kubernetes shows NodeHasDiskPressure on one of physical servers and drops elastic node -> unassigned shards.
Concerning to disk pressure, /var/lib/docker ( distinct mounted disk ) becomes full by df -k, but by du -sk is not filled at all.

Events:
  Type     Reason                 Age               From                   Message
  ----     ------                 ----              ----                   -------
  Warning  FreeDiskSpaceFailed    14m               kubelet, sec-logger06  failed to garbage collect required amount of images. Wanted to free 24355060940 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed    9m                kubelet, sec-logger06  failed to garbage collect required amount of images. Wanted to free 45784526028 bytes, but freed 0 bytes
  Warning  ImageGCFailed          9m                kubelet, sec-logger06  failed to garbage collect required amount of images. Wanted to free 45784526028 bytes, but freed 0 bytes
  Normal   NodeHasDiskPressure    7m (x13 over 1d)  kubelet, sec-logger06  Node sec-logger06 status is now: NodeHasDiskPressure
  Warning  EvictionThresholdMet   7m (x15 over 1d)  kubelet, sec-logger06  Attempting to reclaim imagefs
  Normal   NodeHasNoDiskPressure  2m (x15 over 1d)  kubelet, sec-logger06  Node sec-logger06 status is now: NodeHasNoDiskPressure

It sounds like the storage you are using may not be perform any enough. What type of disk/storage are you using?

Hitachi SAN

curl -XGET 10.99.40.241:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
A lot of UNASSIGNED NODE_LEFT and UNASSIGNED DANGLING_INDEX_IMPORTED on primary shards
Is there a way out of this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.