"failed shard on node... ...Data too large, data for [<transport_request>] would be" only for 3 most recent .monitoring-es indices

Looking for some direction with this unique issue in our production cluster. Hoping that @HenningAndersen @dadoonet @Christian_Dahlqvist and/or @Armin_Braun might be able to poke in here and offer feedback.
Details/behavior:

  • This only happens to the 3 most recent indices for the .monitoring-es.

  • This only happens on the production cluster.

  • We do not experience the same behavior in the test cluster. The test cluster is completely green with the exact same configuration, java version, etc (to the best of our knowledge). The only obvious difference is that the test cluster shards are much smaller (400mb vs 4gb in production).

  • If we try POST /_cluster/reroute?retry_failed=true it results in the same 3 indices remaining yellow.

Java

# java -version
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-b10)
OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)

GET _cluster/stats?human&pretty

{
  "_nodes" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "cluster_name" : "ELK",
  "cluster_uuid" : "xxx",
  "timestamp" : 1581437726376,
  "status" : "yellow",
  "indices" : {
    "count" : 1993,
    "shards" : {
      "total" : 6675,
      "primaries" : 5745,
      "replication" : 0.1618798955613577,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 10,
          "avg" : 3.3492222779729053
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 2.882589061716006
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.10135474159558455
        }
      }
    },
    "docs" : {
      "count" : 1872979590,
      "deleted" : 526087
    },
    "store" : {
      "size" : "900.5gb",
      "size_in_bytes" : 966911625537
    },
    "fielddata" : {
      "memory_size" : "6.7gb",
      "memory_size_in_bytes" : 7240630464,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "72.6mb",
      "memory_size_in_bytes" : 76186271,
      "total_count" : 2581869,
      "hit_count" : 336991,
      "miss_count" : 2244878,
      "cache_size" : 5123,
      "cache_count" : 5476,
      "evictions" : 353
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 9796,
      "memory" : "1.6gb",
      "memory_in_bytes" : 1752365455,
      "terms_memory" : "869.3mb",
      "terms_memory_in_bytes" : 911551073,
      "stored_fields_memory" : "720.3mb",
      "stored_fields_memory_in_bytes" : 755351840,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "11mb",
      "norms_memory_in_bytes" : 11637888,
      "points_memory" : "59.9mb",
      "points_memory_in_bytes" : 62903582,
      "doc_values_memory" : "10.4mb",
      "doc_values_memory_in_bytes" : 10921072,
      "index_writer_memory" : "19.8mb",
      "index_writer_memory_in_bytes" : 20774632,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "11.7mb",
      "fixed_bit_set_memory_in_bytes" : 12320792,
      "max_unsafe_auto_id_timestamp" : 1581429474359,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 2,
      "data" : 2,
      "coordinating_only" : 0,
      "master" : 1,
      "ingest" : 1
    },
    "versions" : [
      "6.8.1"
    ],
    "os" : {
      "available_processors" : 12,
      "allocated_processors" : 12,
      "names" : [
        {
          "name" : "Linux",
          "count" : 2
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 2
        }
      ],
      "mem" : {
        "total" : "44.7gb",
        "total_in_bytes" : 48084279296,
        "free" : "558.7mb",
        "free_in_bytes" : 585912320,
        "used" : "44.2gb",
        "used_in_bytes" : 47498366976,
        "free_percent" : 1,
        "used_percent" : 99
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 50
      },
      "open_file_descriptors" : {
        "min" : 9714,
        "max" : 40306,
        "avg" : 25010
      }
    },
    "jvm" : {
      "max_uptime" : "53.5d",
      "max_uptime_in_millis" : 4625220711,
      "versions" : [
        {
          "version" : "1.8.0_191",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.191-b12",
          "vm_vendor" : "Oracle Corporation",
          "count" : 1
        },
        {
          "version" : "1.8.0_171",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.171-b10",
          "vm_vendor" : "Oracle Corporation",
          "count" : 1
        }
      ],
      "mem" : {
        "heap_used" : "14.2gb",
        "heap_used_in_bytes" : 15307236744,
        "heap_max" : "22.9gb",
        "heap_max_in_bytes" : 24591466496
      },
      "threads" : 444
    },
    "fs" : {
      "total" : "7.3tb",
      "total_in_bytes" : 8122104676352,
      "free" : "5.8tb",
      "free_in_bytes" : 6417517572096,
      "available" : "5.4tb",
      "available_in_bytes" : 6032500862976
    },
 ...
}

GET _cluster/allocation/explain?pretty

{
  "index" : ".monitoring-es-6-2020.02.09",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-02-10T20:53:42.002Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [cXzKDgu0TPqjEF-w9qJURw]: failed recovery, failure RecoveryFailedException[[.monitoring-es-6-2020.02.09][0]: Recovery failed from {ELK-01.hostname.com}{WNppBaICQTKHpO2DAedbpQ}{lKY-IuxnT3auVM50oh_IeA}{ELK-01.hostname.com}{192.168.98.102:9300}{ml.machine_memory=25103704064, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, ml.enabled=true} into {ELK-02.hostname.com}{cXzKDgu0TPqjEF-w9qJURw}{nwT4vhVMSNKC_xthmClT_A}{ELK-02.hostname.com}{192.168.98.103:9300}{ml.machine_memory=22980575232, xpack.installed=true, box_type=warm, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[ELK-01.hostname.com][192.168.98.102:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [144] files with total size of [4.2gb]]; nested: RemoteTransportException[[ELK-02.hostname.com][192.168.98.103:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [8231262357/7.6gb], which is larger than the limit of [8231203635/7.6gb], usages [request=0/0b, fielddata=6623107425/6.1gb, in_flight_requests=1022428/998.4kb, accounting=1607132504/1.4gb]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "WNppBaICQTKHpO2DAedbpQ",
      "node_name" : "ELK-01.hostname.com",
      "transport_address" : "192.168.98.102:9300",
      "node_attributes" : {
        "ml.machine_memory" : "25103704064",
        "xpack.installed" : "true",
        "box_type" : "hot",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-02-10T20:53:42.002Z], failed_attempts[5], delayed=false, details[failed shard on node [cXzKDgu0TPqjEF-w9qJURw]: failed recovery, failure RecoveryFailedException[[.monitoring-es-6-2020.02.09][0]: Recovery failed from {ELK-01.hostname.com}{WNppBaICQTKHpO2DAedbpQ}{lKY-IuxnT3auVM50oh_IeA}{ELK-01.hostname.com}{192.168.98.102:9300}{ml.machine_memory=25103704064, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, ml.enabled=true} into {ELK-02.hostname.com}{cXzKDgu0TPqjEF-w9qJURw}{nwT4vhVMSNKC_xthmClT_A}{ELK-02.hostname.com}{192.168.98.103:9300}{ml.machine_memory=22980575232, xpack.installed=true, box_type=warm, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[ELK-01.hostname.com][192.168.98.102:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [144] files with total size of [4.2gb]]; nested: RemoteTransportException[[ELK-02.hostname.com][192.168.98.103:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [8231262357/7.6gb], which is larger than the limit of [8231203635/7.6gb], usages [request=0/0b, fielddata=6623107425/6.1gb, in_flight_requests=1022428/998.4kb, accounting=1607132504/1.4gb]]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[.monitoring-es-6-2020.02.09][0], node[WNppBaICQTKHpO2DAedbpQ], [P], s[STARTED], a[id=DpdjTOlISEqRX0DUk7ECnw]]"
        }
      ]
    },
    {
      "node_id" : "cXzKDgu0TPqjEF-w9qJURw",
      "node_name" : "ELK-02.hostname.com",
      "transport_address" : "192.168.98.103:9300",
      "node_attributes" : {
        "ml.machine_memory" : "22980575232",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "box_type" : "warm",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-02-10T20:53:42.002Z], failed_attempts[5], delayed=false, details[failed shard on node [cXzKDgu0TPqjEF-w9qJURw]: failed recovery, failure RecoveryFailedException[[.monitoring-es-6-2020.02.09][0]: Recovery failed from {ELK-01.hostname.com}{WNppBaICQTKHpO2DAedbpQ}{lKY-IuxnT3auVM50oh_IeA}{ELK-01.hostname.com}{192.168.98.102:9300}{ml.machine_memory=25103704064, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, ml.enabled=true} into {ELK-02.hostname.com}{cXzKDgu0TPqjEF-w9qJURw}{nwT4vhVMSNKC_xthmClT_A}{ELK-02.hostname.com}{192.168.98.103:9300}{ml.machine_memory=22980575232, xpack.installed=true, box_type=warm, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[ELK-01.hostname.com][192.168.98.102:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [144] files with total size of [4.2gb]]; nested: RemoteTransportException[[ELK-02.hostname.com][192.168.98.103:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [8231262357/7.6gb], which is larger than the limit of [8231203635/7.6gb], usages [request=0/0b, fielddata=6623107425/6.1gb, in_flight_requests=1022428/998.4kb, accounting=1607132504/1.4gb]]; ], allocation_status[no_attempt]]]"
        }
      ]
    }
  ]
}

I've seen a couple of different recommendations for other situations where this error arises. Such as changing bulk request size and editing the jvm.options file to update G1 GC settings. Since we are running Java 8, I don't believe the G1 GC settings apply.

Hi @mattsdevop

What you're running into is a bit of a known issue with the real memory circuit breaker that you can find here.

In detail, what you are experiencing is this:

  1. A shard replica is allocated
  2. The replica is recovered into or replicated into
  3. The real memory circuit breaker kicks in because you probably are randomly using almost all of your heap in a load spike.
  4. That fails the replica
  5. Have that happen a few times in a row and exhaust the retries -> cluster stays yellow and the allocation of the replica isn't retried.

There's a few things you can do here to fix or work around this:

  1. Obviously, increase available heap on the affected nodes if possible :slight_smile: I do realise this is not necessarily an option but just saying :slight_smile:
  2. Take a look at the recovery settings here. If you have modified those manually to get quicker recoveries tone them down. If you haven't you can try setting lower values than the defaults for indices.recovery.max_bytes_per_sec and/or (I'd experiment here) indices.recovery.max_concurrent_file_chunks (could just try this at 1 and the bytes per sec at 20MB/s for example).
  3. You could look into your mappings and see if field data should really take up 7GB of heap. Seems there may be some mapping/term-value explosion here? Lowering the memory usage here seems like a useful thing to do in general if this is not by design somehow.
  4. This will work most likely but I'd rather you leave this a your last option as this moves you significantly away from standard ES settings: You could turn off the real memory circuit breaker as documented here by setting indices.breaker.total.use_real_memory
2 Likes

@Armin_Braun thank you for the very detailed reply. You gave me some direction on this. You are right, number 1 isn't an option :slight_smile: (and isn't an option that I want to resort to simply because it exists).

I started looking more at 2, 3, and 4 from your reply. This cluster is on 6.8 and it looks like things have changed in 7.x. I checked breaker stats and discovered that my secondary node was actually the offender. The parent breaker had tripped something like 11k times (I have multiple curator jobs that run nightly for cleanup, which likely ran this up). The CPU usage for elasticsearch was also jumping between 300-400% for elasticsearch when checking with top. The stats looked something like this before I started:

"parent" : {
          "limit_size_in_bytes" : 8231203635,
          "limit_size" : "7.6gb",
          "estimated_size_in_bytes" : 8211203635,
          "estimated_size" : "7.2gb",
          "overhead" : 1.0,
          "tripped" : 112874
        }

Unfortunately, I did not record the other sections from the breaker stats.
For reference, curator logs were dumping:

NO MATCH: Value for key "0", health check data: 2
KWARGS= "{'relocating_shards': 0}"

and research on those messages led me nowhere.

I also discovered from watching the curator logs that some of my indices were not being allocated to the "warm" node. However, they remained green in the cluster. On the surface it appeared there was no issue with any indices other than the .monitoring-es indices. However, this was not the case. I used GET index_name_here/_search_shards to check allocation of indices that were showing up in the curator logs as Waiting for shards to complete relocation for indices:.

At this point, it was obvious something was wrong on the secondary node. My suspicion is that elasticsearch had not been able to successfully GC and CPU was running away trying to catch up. I restarted elasticsearch on the secondary node. The first start was not successful. After waiting about 5 mins, I chose to stop it completely, wait a few minutes, and start again. This time, it came up. It seemed to fight with electing a master node longer than usual, but eventually came up after about 5 minutes of seeing logs of no master eligible nodes. Once the cluster turned yellow, I see that the parent breaker heap estimated size is back down. After giving it time to completely come up and the cluster turned green, I now see:

      "attributes" : {
        "ml.machine_memory" : "22980575232",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "box_type" : "warm",
        "ml.enabled" : "true"
      },
      "breakers" : {
        "request" : {
          "limit_size_in_bytes" : 7055317401,
          "limit_size" : "6.5gb",
          "estimated_size_in_bytes" : 0,
          "estimated_size" : "0b",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "fielddata" : {
          "limit_size_in_bytes" : 7055317401,
          "limit_size" : "6.5gb",
          "estimated_size_in_bytes" : 129584,
          "estimated_size" : "126.5kb",
          "overhead" : 1.03,
          "tripped" : 0
        },
        "in_flight_requests" : {
          "limit_size_in_bytes" : 11758862336,
          "limit_size" : "10.9gb",
          "estimated_size_in_bytes" : 9934584,
          "estimated_size" : "9.4mb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "accounting" : {
          "limit_size_in_bytes" : 11758862336,
          "limit_size" : "10.9gb",
          "estimated_size_in_bytes" : 1704292039,
          "estimated_size" : "1.5gb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "parent" : {
          "limit_size_in_bytes" : 8231203635,
          "limit_size" : "7.6gb",
          "estimated_size_in_bytes" : 1714356207,
          "estimated_size" : "1.5gb",
          "overhead" : 1.0,
          "tripped" : 0
        }

All of my indices are green and correctly allocated at this point. So, everything seems to be happy for now. From here, is my direction still exploring indices.recovery.max_bytes_per_sec? Or, do you think I'll better benefit from circuit breaker settings? Do you think this is a one-off and I should leave everything as-is and monitor? Certainly seems like odd behavior.

Documenting additional info from short term testing. I'm trying to force the secondary node (indices older than 8 days) into using large amounts of heap by searching 60 days back to introduce load. I haven't been able to trip it up yet searching (using search in the Discover tab in Kibana) over multiple index patterns (multiple gb of documents) for the same logmessage. The load on the server increases with the search and comes back down after searching as expected.

After searching multiple times over different index patterns for 60 days, the breaker stats look like:

      "attributes" : {
        "ml.machine_memory" : "22980575232",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "box_type" : "warm",
        "ml.enabled" : "true"
      },
      "breakers" : {
        "request" : {
          "limit_size_in_bytes" : 7055317401,
          "limit_size" : "6.5gb",
          "estimated_size_in_bytes" : 0,
          "estimated_size" : "0b",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "fielddata" : {
          "limit_size_in_bytes" : 7055317401,
          "limit_size" : "6.5gb",
          "estimated_size_in_bytes" : 129584,
          "estimated_size" : "126.5kb",
          "overhead" : 1.03,
          "tripped" : 0
        },
        "in_flight_requests" : {
          "limit_size_in_bytes" : 11758862336,
          "limit_size" : "10.9gb",
          "estimated_size_in_bytes" : 3965,
          "estimated_size" : "3.8kb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "accounting" : {
          "limit_size_in_bytes" : 11758862336,
          "limit_size" : "10.9gb",
          "estimated_size_in_bytes" : 1705548753,
          "estimated_size" : "1.5gb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "parent" : {
          "limit_size_in_bytes" : 8231203635,
          "limit_size" : "7.6gb",
          "estimated_size_in_bytes" : 1705682302,
          "estimated_size" : "1.5gb",
          "overhead" : 1.0,
          "tripped" : 0
        }

The second node is back at it again...

      "attributes" : {
        "ml.machine_memory" : "22980575232",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "box_type" : "warm",
        "ml.enabled" : "true"
      },
      "breakers" : {
        "request" : {
          "limit_size_in_bytes" : 7055317401,
          "limit_size" : "6.5gb",
          "estimated_size_in_bytes" : 33096,
          "estimated_size" : "32.3kb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "fielddata" : {
          "limit_size_in_bytes" : 7055317401,
          "limit_size" : "6.5gb",
          "estimated_size_in_bytes" : 6351625912,
          "estimated_size" : "5.9gb",
          "overhead" : 1.03,
          "tripped" : 0
        },
        "in_flight_requests" : {
          "limit_size_in_bytes" : 11758862336,
          "limit_size" : "10.9gb",
          "estimated_size_in_bytes" : 3965,
          "estimated_size" : "3.8kb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "accounting" : {
          "limit_size_in_bytes" : 11758862336,
          "limit_size" : "10.9gb",
          "estimated_size_in_bytes" : 1687508343,
          "estimated_size" : "1.5gb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "parent" : {
          "limit_size_in_bytes" : 8231203635,
          "limit_size" : "7.6gb",
          "estimated_size_in_bytes" : 8039171316,
          "estimated_size" : "7.4gb",
          "overhead" : 1.0,
          "tripped" : 344038
        }

Digging more, it appears fielddata is filling up. When I check GET */_mapping, I can't find any text fields that explicitly have fielddata=true. But, it looks like I have fielddata in the .monitoring-es indices. It appears this is default based on the file in github. This is what mine returns:

          "index_stats" : {
            "properties" : {
              "index" : {
                "type" : "keyword"
              },
              "primaries" : {
                "properties" : {
                  "docs" : {
                    "properties" : {
                      "count" : {
                        "type" : "long"
                      }
                    }
                  },
                  "fielddata" : {
                    "properties" : {
                      "evictions" : {
                        "type" : "long"
                      },
                      "memory_size_in_bytes" : {
                        "type" : "long"
                      }
                    }
                  },

When I check all of my templates, the only one that has the term fielddata in it is the .monitoring-es template. I'm listing it here:

  ".monitoring-es" : {
    "order" : 0,
    "version" : 6070299,
    "index_patterns" : [
      ".monitoring-es-6-*"
    ],
    "settings" : {
      "index" : {
        "format" : "6",
        "codec" : "best_compression",
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "number_of_replicas" : "0"
      }
    },
    "mappings" : {
        ...
          "index_stats" : {
            "properties" : {
              "index" : {
                "type" : "keyword"
              },
              "primaries" : {
                "properties" : {
                  "docs" : {
                    "properties" : {
                      "count" : {
                        "type" : "long"
                      }
                    }
                  },
                  "fielddata" : {
                    "properties" : {
                      "memory_size_in_bytes" : {
                        "type" : "long"
                      },
                      "evictions" : {
                        "type" : "long"
                      }
                    }
                  },
                ...
              "total" : {
                "properties" : {
                  "docs" : {
                    "properties" : {
                      "count" : {
                        "type" : "long"
                      }
                    }
                  },
                  "fielddata" : {
                    "properties" : {
                      "memory_size_in_bytes" : {
                        "type" : "long"
                      },
                      "evictions" : {
                        "type" : "long"
                      }
                    }
                  },
                  "store" : {
                    "properties" : {
                      "size_in_bytes" : {
                        "type" : "long"
                      }
                    }
                  },
                ...
              "indices" : {
                "properties" : {
                  "docs" : {
                    "properties" : {
                      "count" : {
                        "type" : "long"
                      }
                    }
                  },
                  "fielddata" : {
                    "properties" : {
                      "memory_size_in_bytes" : {
                        "type" : "long"
                      },
                      "evictions" : {
                        "type" : "long"
                      }
                    }
                  },
    ...
  },

I'm reading that fielddata is disabled by default. I don't have it enabled in my configuration files anywhere that I can find. Is there any reason for me to not change this? How should I modify this as to not break it?

I think that GET _stats/fielddata?fielddata_fields=* will show you a field-by-field breakdown of the memory used by fielddata, which should hopefully point you in the right direction.

1 Like

Thank you @DavidTurner for the quick reply. I just discovered that command as you were responding and came back to post my findings.

It turns out that many of my shards are culprits. Looking something like this:

    "env-stg-2020.02.20" : {
      "uuid" : "TgkrNGtKTwGNDeSshJ2YLQ",
      "primaries" : {
        "fielddata" : {
          "memory_size_in_bytes" : 64716808,
          "evictions" : 0,
          "fields" : {
            "logmessage.keyword" : {
              "memory_size_in_bytes" : 225728
            },
            "fields.service.keyword" : {
              "memory_size_in_bytes" : 2096
            },
            "fields.env.keyword" : {
              "memory_size_in_bytes" : 1088
            },
            "_id" : {
              "memory_size_in_bytes" : 64487896
            }
          }
        }
      },

My template for this index looks like this:

  "default" : {
    "order" : -1,
    "index_patterns" : [
      "env-stg*",
      "env-prd*",
      "env2-stg*",
      "env2-prd*"
    ],
    "settings" : {
      "index" : {
        "number_of_shards" : "2",
        "number_of_replicas" : "0"
      }
    },
    "mappings" : {
      "doc" : {
        "properties" : {
          "logTimestamphttp" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "auth" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "ident" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "source" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "lc_identifier" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "logTimestamp" : {
            "type" : "date"
          },
          "clientip" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "@version" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "beat" : {
            "properties" : {
              "hostname" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              },
              "name" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              },
              "version" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              }
            }
          },
          "host" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "logTimestampString" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "class" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "offset" : {
            "type" : "long"
          },
          "logmessage" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "verb" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "prospector" : {
            "properties" : {
              "type" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              }
            }
          },
          "thread" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "tags" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "input" : {
            "properties" : {
              "type" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              }
            }
          },
          "prelogmessage" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "@timestamp" : {
            "type" : "date"
          },
          "bytes" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "response" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "loglevel" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "httpversion" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "ignore_above" : 256,
                "type" : "keyword"
              }
            }
          },
          "fields" : {
            "properties" : {
              "product" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              },
              "service" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              },
              "env" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              },
              "customer" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              }
            }
          }
        }
      }
    },
    "aliases" : { }
  },

It looks like I overlooked the template in testing when comparing for deltas between the production and testing environments. However, it appears everything is working how I'd expect in testing. I believe the template currently in use in production was cherry-picked and persisted since we were on version 5. I don't believe we were using it to do anything specific other than set the number of shards/replicas originally.

As a reference, the testing environment looks like this:

{
  "_shards" : {
    "total" : 239,
    "successful" : 239,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "fielddata" : {
        "memory_size_in_bytes" : 0,
        "evictions" : 0,
        "fields" : {
          "type" : {
            "memory_size_in_bytes" : 0
          },
          "_id" : {
            "memory_size_in_bytes" : 0
          }
        }
      }
    },

I imagine removing this template altogether is an option. Or, replacing with the template that exists in our testing environment currently. Is there a better option?

This would have been my first guess:

We load fielddata on the _id field if you sort or aggregate on it. It looks like something's doing that, and it's probably a good idea to find out what it is and stop it. In 7.6 you can prevent this with a cluster setting but note that this will break whatever is causing this.

You can clear the fielddata cache which should free up a bunch of memory.

1 Like

That was definitely the bulk of fielddata usage. Some Visualize graphs were built from Unique Count on _id. I've changed those to Count for now. They seem to be relatively the same result. But, as you've pointed out Count avoids the massive heap usage.

Following that, cleared the cache using the article you provided, then POST /_cluster/reroute?retry_failed=true. After about 10 minutes, everything is Healthy :green_circle:.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.