Spike on CPU usage relates to the increase of fielddata memory

Hi team,

We faced a very weird situation in one of our production clusters.
Suddenly the CPU utilization of all nodes got 100% after being consistently under 30% for a long time. Looking at Kibana metrics, the only variable that increased together with CPU load was the fielddata memory usage. So our assumption is that may be the root cause, but we cannot understand exactly why.

PS1: the index that caused the “mini-spike” on fielddata memory receives many search requests with aggregation, but the nr of requests at the issue time was under normal amount
PS2: It’s clear that the CPU reached its peak while the fielddata memory was increasing. By the time it went down to the previous usage, CPU usage also decreased.
PS3: the index does not have fielddata cache enabled for any text field. The field that uses more memory is the internal _id

Am I right that the increase in fielddata memory can overload the cluster? If this is correct, how could we control it since we already have disabled fielddata cache in the indices mapping?


Thank you!

Which version of Elasticsearch are you using?

What is the full output of the cluster stats API?

What type of hardware is the cluster deployed on? What is the specification of the cluster?

Is there anything in the Elasticsearch logs around the time that could help?

Which version of Elasticsearch are you using?

7.10.2

What is the full output of the cluster stats API?

{
    "_nodes": {
        "total": 39,
        "successful": 39,
        "failed": 0
    },
    "cluster_name": "es5",
    "cluster_uuid": "_OudYRJ-RICDlvsvJLGjPQ",
    "timestamp": 1686151611099,
    "status": "green",
    "indices": {
        "count": 68,
        "shards": {
            "total": 1694,
            "primaries": 847,
            "replication": 1.0,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 128,
                    "avg": 24.91176470588235
                },
                "primaries": {
                    "min": 1,
                    "max": 64,
                    "avg": 12.455882352941176
                },
                "replication": {
                    "min": 1.0,
                    "max": 1.0,
                    "avg": 1.0
                }
            }
        },
        "docs": {
            "count": 7829563501,
            "deleted": 1030706222
        },
        "store": {
            "size_in_bytes": 9622182356304,
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 39584717832,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 0,
            "total_count": 0,
            "hit_count": 0,
            "miss_count": 0,
            "cache_size": 0,
            "cache_count": 0,
            "evictions": 0
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 36300,
            "memory_in_bytes": 5905353648,
            "terms_memory_in_bytes": 4245119520,
            "stored_fields_memory_in_bytes": 19364880,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 276464768,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 1364404480,
            "index_writer_memory_in_bytes": 9218763928,
            "version_map_memory_in_bytes": 336156,
            "fixed_bit_set_memory_in_bytes": 5345240464,
            "max_unsafe_auto_id_timestamp": 1683591946009,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": [
                {
                    "name": "boolean",
                    "count": 177,
                    "index_count": 19
                },
                {
                    "name": "date",
                    "count": 132,
                    "index_count": 67
                },
                {
                    "name": "float",
                    "count": 42,
                    "index_count": 7
                },
                {
                    "name": "geo_point",
                    "count": 14,
                    "index_count": 14
                },
                {
                    "name": "geo_shape",
                    "count": 14,
                    "index_count": 14
                },
                {
                    "name": "join",
                    "count": 58,
                    "index_count": 58
                },
                {
                    "name": "keyword",
                    "count": 15967,
                    "index_count": 68
                },
                {
                    "name": "long",
                    "count": 11947,
                    "index_count": 68
                },
                {
                    "name": "nested",
                    "count": 324,
                    "index_count": 61
                },
                {
                    "name": "object",
                    "count": 4543,
                    "index_count": 68
                },
                {
                    "name": "text",
                    "count": 4618,
                    "index_count": 68
                }
            ]
        },
        "analysis": {
            "char_filter_types": [
                {
                    "name": "mapping",
                    "count": 63,
                    "index_count": 63
                }
            ],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [
                {
                    "name": "custom",
                    "count": 63,
                    "index_count": 63
                }
            ],
            "built_in_char_filters": [
                {
                    "name": "icu_normalizer",
                    "count": 63,
                    "index_count": 63
                }
            ],
            "built_in_tokenizers": [
                {
                    "name": "icu_tokenizer",
                    "count": 63,
                    "index_count": 63
                }
            ],
            "built_in_filters": [
                {
                    "name": "icu_folding",
                    "count": 63,
                    "index_count": 63
                }
            ],
            "built_in_analyzers": []
        }
    },
    "nodes": {
        "count": {
            "total": 39,
            "coordinating_only": 0,
            "data": 36,
            "data_cold": 0,
            "data_content": 0,
            "data_hot": 0,
            "data_warm": 0,
            "ingest": 39,
            "master": 3,
            "ml": 0,
            "remote_cluster_client": 0,
            "transform": 0,
            "voting_only": 0
        },
        "versions": [
            "7.10.2"
        ],
        "os": {
            "available_processors": 480,
            "allocated_processors": 480,
            "names": [
                {
                    "name": "Linux",
                    "count": 39
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "CentOS Linux 8",
                    "count": 39
                }
            ],
            "mem": {
                "total_in_bytes": 2254857830400,
                "free_in_bytes": 27275284480,
                "used_in_bytes": 2227582545920,
                "free_percent": 1,
                "used_percent": 99
            }
        },
        "process": {
            "cpu": {
                "percent": 308
            },
            "open_file_descriptors": {
                "min": 1205,
                "max": 2105,
                "avg": 1983
            }
        },
        "jvm": {
            "max_uptime_in_millis": 1535613104,
            "versions": [
                {
                    "version": "15.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "15.0.1+9",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 39
                }
            ],
            "mem": {
                "heap_used_in_bytes": 497588438616,
                "heap_max_in_bytes": 1127428915200
            },
            "threads": 3507
        },
        "fs": {
            "total_in_bytes": 65114254344192,
            "free_in_bytes": 55355361214464,
            "available_in_bytes": 55355361214464
        },
        "plugins": [
            {
                "name": "repository-azure",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The Azure Repository plugin adds support for Azure storage repositories.",
                "classname": "org.elasticsearch.repositories.azure.AzureRepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false
            },
            {
                "name": "analysis-icu",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding ICU-related analysis components.",
                "classname": "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
                "extended_plugins": [],
                "has_native_controller": false
            },
            {
                "name": "repository-s3",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The S3 repository plugin adds S3 repositories",
                "classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false
            },
            {
                "name": "repository-gcs",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The GCS repository plugin adds Google Cloud Storage support for repositories.",
                "classname": "org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin",
                "extended_plugins": [],
                "has_native_controller": false
            }
        ],
        "network_types": {
            "transport_types": {
                "netty4": 39
            },
            "http_types": {
                "netty4": 39
            }
        },
        "discovery_types": {
            "zen": 39
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "docker",
                "count": 39
            }
        ],
        "ingest": {
            "number_of_pipelines": 0,
            "processor_stats": {}
        }
    }
}

What type of hardware is the cluster deployed on? What is the specification of the cluster?

  • The cluster is deployed on EC2 AWS instances
  • 36 data nodes spread into 3 zones
  • Each data node:
    • 29184mb HEAP
    • 64gb machine total memory
    • 13 vCPUs
    • 1.5tb for storage

Is there anything in the Elasticsearch logs around the time that could help?

Only timeout WARN logs, probably due to the high CPU load.

May 30 23:03:09 es5-es-data-2-7 elasticsearch WARN Received response for a request that has timed out, sent [1200ms] ago, timed out [200ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-1-8}{Sy8njqbpQ_OUq7C4i1y2QA}{aOdsKc0qTtqfbfgtfp1-Qw}{10.17.53.126}{10.17.53.126:9300}{di}{k8s_node_name=ip-10-17-2-147.ec2.internal, xpack.installed=true, zone=us-east-1a, transform.node=false}], id [36759818]
May 30 23:10:28 es5-es-master-1-0 elasticsearch WARN Received response for a request that has timed out, sent [2401ms] ago, timed out [1400ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-3-8}{3UsRs9bSR1ibctFI9WxanA}{IWtkX_JBReeHz02bnlMSPw}{10.17.182.26}{10.17.182.26:9300}{di}{k8s_node_name=ip-10-17-165-105.ec2.internal, xpack.installed=true, zone=us-east-1c, transform.node=false}], id [81419022]
May 30 23:10:28 es5-es-master-1-0 elasticsearch WARN Received response for a request that has timed out, sent [2401ms] ago, timed out [1400ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-3-8}{3UsRs9bSR1ibctFI9WxanA}{IWtkX_JBReeHz02bnlMSPw}{10.17.182.26}{10.17.182.26:9300}{di}{k8s_node_name=ip-10-17-165-105.ec2.internal, xpack.installed=true, zone=us-east-1c, transform.node=false}], id [81419022]
May 30 23:10:28 es5-es-data-2-6 elasticsearch WARN Received response for a request that has timed out, sent [2401ms] ago, timed out [1400ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-3-8}{3UsRs9bSR1ibctFI9WxanA}{IWtkX_JBReeHz02bnlMSPw}{10.17.182.26}{10.17.182.26:9300}{di}{k8s_node_name=ip-10-17-165-105.ec2.internal, xpack.installed=true, zone=us-east-1c, transform.node=false}], id [36216489]
May 30 23:10:28 es5-es-data-2-1 elasticsearch WARN Received response for a request that has timed out, sent [1600ms] ago, timed out [600ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-3-8}{3UsRs9bSR1ibctFI9WxanA}{IWtkX_JBReeHz02bnlMSPw}{10.17.182.26}{10.17.182.26:9300}{di}{k8s_node_name=ip-10-17-165-105.ec2.internal, xpack.installed=true, zone=us-east-1c, transform.node=false}], id [34979619]
May 30 23:10:40 es5-es-data-2-7 elasticsearch WARN Received response for a request that has timed out, sent [1371ms] ago, timed out [370ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-2-8}{6rHioiCFR62zmOWwpiWuUQ}{qv9ZzKvPRc2CqmbGdweo2A}{10.17.71.216}{10.17.71.216:9300}{di}{k8s_node_name=ip-10-17-90-69.ec2.internal, xpack.installed=true, zone=us-east-1b, transform.node=false}], id [36845982]
May 30 23:11:34 es5-es-data-3-9 elasticsearch WARN Received response for a request that has timed out, sent [1201ms] ago, timed out [200ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-2-2}{23IrghTPScmF-7nCEJR3_g}{CfyjH1wXRT-QxQ7SnlRtMQ}{10.17.113.138}{10.17.113.138:9300}{di}{k8s_node_name=ip-10-17-125-35.ec2.internal, xpack.installed=true, zone=us-east-1b, transform.node=false}], id [29630389]
May 30 23:11:34 es5-es-data-3-9 elasticsearch WARN Received response for a request that has timed out, sent [1201ms] ago, timed out [200ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-1-9}{p6ck8X6CT6aJdCONjLpxOg}{6vxrnlEfTpSD09yeuq4Q_g}{10.17.36.66}{10.17.36.66:9300}{di}{k8s_node_name=ip-10-17-35-216.ec2.internal, xpack.installed=true, zone=us-east-1a, transform.node=false}], id [29630367]
May 30 23:11:34 es5-es-data-3-9 elasticsearch WARN Received response for a request that has timed out, sent [1201ms] ago, timed out [200ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-2-6}{53Iw6u3tRQalKQR9cxa-EA}{IHa3h8ItTCmbUj_EFK5Iug}{10.17.100.17}{10.17.100.17:9300}{di}{k8s_node_name=ip-10-17-92-68.ec2.internal, xpack.installed=true, zone=us-east-1b, transform.node=false}], id [29630364]
May 30 23:13:01 es5-es-data-2-6 elasticsearch WARN Received response for a request that has timed out, sent [2201ms] ago, timed out [1201ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-1-0}{LkyqUapYT_eP5sNQlaqA6A}{R5S4ikSpRdmu_tMJ-HV6LA}{10.17.12.212}{10.17.12.212:9300}{di}{k8s_node_name=ip-10-17-34-230.ec2.internal, xpack.installed=true, zone=us-east-1a, transform.node=false}], id [36241282]
May 30 23:13:01 es5-es-data-2-6 elasticsearch WARN Received response for a request that has timed out, sent [2201ms] ago, timed out [1201ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-master-1-0}{MYskVyDzR66DjKEWQl8g1g}{WmNTZLcxQYCtbihDrP-fDw}{10.17.44.72}{10.17.44.72:9300}{im}{k8s_node_name=ip-10-17-23-210.ec2.internal, xpack.installed=true, zone=us-east-1a, transform.node=false}], id [36241299]
May 30 23:13:02 es5-es-data-2-6 elasticsearch WARN Received response for a request that has timed out, sent [2601ms] ago, timed out [1601ms] ago, action [cluster:monitor/nodes/info[n]], node [{es5-es-data-2-4}{Yu17lugqSYCkBuShTyZdUQ}{2aqexsvMT2GlvDRPaQlLag}{10.17.105.202}{10.17.105.202:9300}{di}{k8s_node_name=ip-10-17-87-116.ec2.internal, xpack.installed=true, zone=us-east-1b, transform.node=false}], id [36241283]

Please note that version is EOL and no longer supported, you should be looking to upgrade as a matter of urgency.

Your JVM is also super old and you should upgrade that.

We are going to upgrade it soon to v7.17 and then to 8.8.
Do you think that upgrade is enough to address this issue with fielddata memory?

Not sure if it is enough, but it will help a lot from my experience.

When I upgraded from 7.9 to 7.12 I noticed a great improvement on performance and memory usage and another improvement when going to 7.17 and then 8.

There are a couple of memory improvements between 7.10 and 7.17 that may help your case.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.