Elastic stuck and flooded with "monitoring execution failed"

ludaca · September 8, 2020, 11:36am

Hello,
ES 5.6.10 under k8s cluster v1.9
xpack.security disabled
16 data nodes
Large number of unassigned shards that do not go down at all.

[2020-09-08T11:15:40,655][WARN ][o.e.x.m.MonitoringService] [es-data05-0] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.<init>(ExportBulk.java:106) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:104) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:217) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBu

Please help what can be checked.
Thanks in advance

warkolm · September 9, 2020, 12:07am

5.X is EOL, you should really upgrade as a matter of urgency.

ludaca · September 9, 2020, 4:40am

We are working with Graylog 2.4.6-1 which requires ES 5.
Is it possible to migrate data from ES 5 to ES 6?

Christian_Dahlqvist · September 9, 2020, 4:41am

What is the full output of the cluster stats API?

ludaca · September 9, 2020, 4:52am

{
  "_nodes" : {
    "total" : 19,
    "successful" : 19,
    "failed" : 0
  },
  "cluster_name" : "graylog-es",
  "timestamp" : 1599627091279,
  "status" : "red",
  "indices" : {
    "count" : 121,
    "shards" : {
      "total" : 1716,
      "primaries" : 1090,
      "replication" : 0.5743119266055046,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 28,
          "avg" : 14.181818181818182
        },
        "primaries" : {
          "min" : 1,
          "max" : 16,
          "avg" : 9.008264462809917
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.6666666666666667,
          "avg" : 0.6750140126173181
        }
      }
    },
    "docs" : {
      "count" : 10043652598,
      "deleted" : 1897511
    },
    "store" : {
      "size_in_bytes" : 18682197963379,
      "throttle_time_in_millis" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 5817379,
      "total_count" : 40108822,
      "hit_count" : 38750480,
      "miss_count" : 1358342,
      "cache_size" : 84688,
      "cache_count" : 163098,
      "evictions" : 78410
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 4012,
      "memory_in_bytes" : 18409026900,
      "terms_memory_in_bytes" : 12330196178,
      "stored_fields_memory_in_bytes" : 5292088712,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 1138176,
      "points_memory_in_bytes" : 430881562,
      "doc_values_memory_in_bytes" : 354722272,
      "index_writer_memory_in_bytes" : 9037368,
      "version_map_memory_in_bytes" : 492,
      "fixed_bit_set_memory_in_bytes" : 1552,
      "max_unsafe_auto_id_timestamp" : 1599626782852,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 19,
      "data" : 16,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 16
    },
    "versions" : [
      "5.6.10"
    ],
    "os" : {
      "available_processors" : 1104,
      "allocated_processors" : 608,
      "names" : [
        {
          "name" : "Linux",
          "count" : 19
        }
      ],
      "mem" : {
        "total_in_bytes" : 10273935179776,
        "free_in_bytes" : 1585044611072,
        "used_in_bytes" : 8688890568704,
        "free_percent" : 15,
        "used_percent" : 85
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 6
      },
      "open_file_descriptors" : {
        "min" : 929,
        "max" : 3900,
        "avg" : 2034
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 131278055,
      "versions" : [
        {
          "version" : "1.8.0_171",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.171-b10",
          "vm_vendor" : "Oracle Corporation",
          "count" : 19
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 81816654152,
        "heap_max_in_bytes" : 380166144000
      },
      "threads" : 5435
    },
    "fs" : {
      "total_in_bytes" : 39013566181376,
      "free_in_bytes" : 12496509353984,
      "available_in_bytes" : 12496509353984,
      "spins" : "true"
    },
    "plugins" : [
      {
        "name" : "ingest-user-agent",
        "version" : "5.6.10",
        "description" : "Ingest processor that extracts information from a user agent",
        "classname" : "org.elasticsearch.ingest.useragent.IngestUserAgentPlugin",
        "has_native_controller" : false
      },
      {
        "name" : "ingest-geoip",
        "version" : "5.6.10",
        "description" : "Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database",
        "classname" : "org.elasticsearch.ingest.geoip.IngestGeoIpPlugin",
        "has_native_controller" : false
      },
      {
        "name" : "x-pack",
        "version" : "5.6.10",
        "description" : "Elasticsearch Expanded Pack Plugin",
        "classname" : "org.elasticsearch.xpack.XPackPlugin",
        "has_native_controller" : true
      }
    ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 19
      },
      "http_types" : {
        "netty4" : 19
      }
    }
  }
}

Christian_Dahlqvist · September 9, 2020, 6:11am

As far as I can see that does not look too bad. Have you looked at the logs to try and identify what happened that led to the shards being unassigned? I have no experience with Graylog, but would recommend you try to upgrade.

ludaca · September 9, 2020, 7:01am

It's running on vanilla k8s, kubernetes shows NodeHasDiskPressure on one of physical servers and drops elastic node -> unassigned shards.
Concerning to disk pressure, /var/lib/docker ( distinct mounted disk ) becomes full by df -k, but by du -sk is not filled at all.

Events:
  Type     Reason                 Age               From                   Message
  ----     ------                 ----              ----                   -------
  Warning  FreeDiskSpaceFailed    14m               kubelet, sec-logger06  failed to garbage collect required amount of images. Wanted to free 24355060940 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed    9m                kubelet, sec-logger06  failed to garbage collect required amount of images. Wanted to free 45784526028 bytes, but freed 0 bytes
  Warning  ImageGCFailed          9m                kubelet, sec-logger06  failed to garbage collect required amount of images. Wanted to free 45784526028 bytes, but freed 0 bytes
  Normal   NodeHasDiskPressure    7m (x13 over 1d)  kubelet, sec-logger06  Node sec-logger06 status is now: NodeHasDiskPressure
  Warning  EvictionThresholdMet   7m (x15 over 1d)  kubelet, sec-logger06  Attempting to reclaim imagefs
  Normal   NodeHasNoDiskPressure  2m (x15 over 1d)  kubelet, sec-logger06  Node sec-logger06 status is now: NodeHasNoDiskPressure

Christian_Dahlqvist · September 9, 2020, 8:23am

It sounds like the storage you are using may not be perform any enough. What type of disk/storage are you using?

ludaca · September 9, 2020, 11:48am

Hitachi SAN

ludaca · September 9, 2020, 1:14pm

curl -XGET 10.99.40.241:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
A lot of UNASSIGNED NODE_LEFT and UNASSIGNED DANGLING_INDEX_IMPORTED on primary shards
Is there a way out of this?

system · October 7, 2020, 1:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES 6.5.4 - Error : monitoring execution failed org.elasticsearch Elasticsearch	3	569	August 5, 2021
Unassigned Shards Elasticsearch	5	466	July 6, 2017
Issue in elastic search after moving data dir Elasticsearch	17	4022	May 1, 2019
Cluster status: red Elasticsearch	6	381	July 6, 2017
Elasticsearch unassigned shards Elasticsearch	7	10937	July 5, 2017

Elastic stuck and flooded with "monitoring execution failed"

Related topics