Elasticsearch perform slow process intermittently

Hi, i have nodejs consumer using rabbitmq that doing one of this two thing.

  1. simple Indexing into elasticsearch using predifined _id (using INDEX API)
  2. search using specific id then if the doc exists it will running update the document partially (using GET API and INDEX API)

Our elasticsearch condition is:

  1. we have 3 data node each has 24Gb RAM; 6 core CPU; Heap memory 15GB
  2. Each document use 1 replica and 1 shards
  3. i separate index with format {my index name}-{YYYY.mm.dd}
  4. the index that actively used by nodejs app ccurrently only contain 500MB size with 5million document
  5. Document size for all cluster contains 600million docs with 500gb size

Most of the process is running under 100ms, but the problem is intermittently elasticsearch give response 12second for all incoming process, thats make our consuming rate (in rabbitmq) stuck in 0 for several seconds.

Is there any internal process in elasticsearch that causing this process stuck for couple of seconds

Thanks

It could be GC, what do your logs show at this time?

Hi, thank for replying and give some clue.
This kind of message frequently appear on master node

#filename: {clustername}.log
[2020-11-17T21:20:20,191][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-3] collector [cluster_stats] timed out when collecting data
[2020-11-17T21:20:24,218][WARN ][o.e.t.TransportService   ] [node-3] Received response for a request that has timed out, sent [12819ms] ago, timed out [2802ms] ago, action [internal:coordination/fault_detection/follower_check], node [{client-node}{BafoEQhsTie3aN-oeOURgA}{DLseWT1xTLOxUOmhlx78vg}{172.19.31.171}{172.19.31.171:9300}{l}{ml.machine_memory=25239240704, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}], id [86288930]
[2020-11-17T21:20:40,148][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-3] collector [cluster_stats] timed out when collecting data
[2020-11-17T21:21:00,162][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-3] collector [cluster_stats] timed out when collecting data
[2020-11-17T21:21:20,157][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-3] collector [cluster_stats] timed out when collecting data
[2020-11-17T21:21:40,152][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-3] collector [cluster_stats] timed out when collecting data
[2020-11-17T21:22:00,179][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-3] collector [cluster_stats] timed out when collecting data
.........
[2020-11-17T21:25:44,347][WARN ][o.e.t.TransportService   ] [node-3] Received response for a request that has timed out, sent [13412ms] ago, timed out [3402ms] ago, action [internal:coordination/fault_detection/follower_check], node [{client-node}{BafoEQhsTie3aN-oeOURgA}{DLseWT1xTLOxUOmhlx78vg}{172.19.31.171}{172.19.31.171:9300}{l}{ml.machine_memory=25239240704, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}], id [86356982]
[2020-11-17T21:26:00,178][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-3] collector [cluster_stats] timed out when collecting data

do you think there is something caused the issue? Or there is another log that i need to checked?

Thanks!

What's the output from _cluster/stats?

Here the output of _cluster/stats

{
    "_nodes": {
        "total": 4,
        "successful": 4,
        "failed": 0
    },
    "cluster_name": "clustername",
    "cluster_uuid": "jCFIdOZSSvmKtFr2ZwmmyQ",
    "timestamp": 1605653605021,
    "status": "green",
    "indices": {
        "count": 1473,
        "shards": {
            "total": 2946,
            "primaries": 1473,
            "replication": 1.0,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 2,
                    "avg": 2.0
                },
                "primaries": {
                    "min": 1,
                    "max": 1,
                    "avg": 1.0
                },
                "replication": {
                    "min": 1.0,
                    "max": 1.0,
                    "avg": 1.0
                }
            }
        },
        "docs": {
            "count": 633989445,
            "deleted": 9170523
        },
        "store": {
            "size_in_bytes": 520881729633
        },
        "fielddata": {
            "memory_size_in_bytes": 16192,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 147224284,
            "total_count": 98145757,
            "hit_count": 307627,
            "miss_count": 97838130,
            "cache_size": 45050,
            "cache_count": 45687,
            "evictions": 637
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 26998,
            "memory_in_bytes": 372963504,
            "terms_memory_in_bytes": 248738872,
            "stored_fields_memory_in_bytes": 20735776,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 3508416,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 99980440,
            "index_writer_memory_in_bytes": 550131526,
            "version_map_memory_in_bytes": 15655958,
            "fixed_bit_set_memory_in_bytes": 122081136,
            "max_unsafe_auto_id_timestamp": 1605653369296,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": [
                {
                    "name": "alias",
                    "count": 26569,
                    "index_count": 1375
                },
                {
                    "name": "binary",
                    "count": 7,
                    "index_count": 2
                },
                {
                    "name": "boolean",
                    "count": 110080,
                    "index_count": 1456
                },
                {
                    "name": "byte",
                    "count": 651,
                    "index_count": 651
                },
                {
                    "name": "date",
                    "count": 97590,
                    "index_count": 1468
                },
                {
                    "name": "date_range",
                    "count": 11,
                    "index_count": 11
                },
                {
                    "name": "double",
                    "count": 83844,
                    "index_count": 1375
                },
                {
                    "name": "flattened",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "float",
                    "count": 86367,
                    "index_count": 1429
                },
                {
                    "name": "geo_point",
                    "count": 10440,
                    "index_count": 1390
                },
                {
                    "name": "geo_shape",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "half_float",
                    "count": 62,
                    "index_count": 17
                },
                {
                    "name": "integer",
                    "count": 183,
                    "index_count": 11
                },
                {
                    "name": "ip",
                    "count": 82261,
                    "index_count": 1397
                },
                {
                    "name": "keyword",
                    "count": 2014345,
                    "index_count": 1469
                },
                {
                    "name": "long",
                    "count": 2061331,
                    "index_count": 1463
                },
                {
                    "name": "nested",
                    "count": 390,
                    "index_count": 370
                },
                {
                    "name": "object",
                    "count": 1962345,
                    "index_count": 1463
                },
                {
                    "name": "scaled_float",
                    "count": 72912,
                    "index_count": 651
                },
                {
                    "name": "short",
                    "count": 73124,
                    "index_count": 724
                },
                {
                    "name": "text",
                    "count": 103404,
                    "index_count": 1461
                }
            ]
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [
                {
                    "name": "pattern_capture",
                    "count": 1,
                    "index_count": 1
                }
            ],
            "analyzer_types": [
                {
                    "name": "custom",
                    "count": 1,
                    "index_count": 1
                }
            ],
            "built_in_char_filters": [],
            "built_in_tokenizers": [
                {
                    "name": "uax_url_email",
                    "count": 1,
                    "index_count": 1
                }
            ],
            "built_in_filters": [
                {
                    "name": "lowercase",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "unique",
                    "count": 1,
                    "index_count": 1
                }
            ],
            "built_in_analyzers": [
                {
                    "name": "simple",
                    "count": 28,
                    "index_count": 7
                }
            ]
        }
    },
    "nodes": {
        "count": {
            "total": 4,
            "coordinating_only": 0,
            "data": 3,
            "ingest": 3,
            "master": 3,
            "ml": 4,
            "remote_cluster_client": 0,
            "transform": 3,
            "voting_only": 0
        },
        "versions": [
            "7.7.0",
            "7.7.1"
        ],
        "os": {
            "available_processors": 24,
            "allocated_processors": 24,
            "names": [
                {
                    "name": "Linux",
                    "count": 4
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "Ubuntu 18.04.3 LTS",
                    "count": 4
                }
            ],
            "mem": {
                "total_in_bytes": 100956962816,
                "free_in_bytes": 1481793536,
                "used_in_bytes": 99475169280,
                "free_percent": 1,
                "used_percent": 99
            }
        },
        "process": {
            "cpu": {
                "percent": 113
            },
            "open_file_descriptors": {
                "min": 398,
                "max": 10811,
                "avg": 8118
            }
        },
        "jvm": {
            "max_uptime_in_millis": 1200108636,
            "versions": [
                {
                    "version": "14",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "14+36",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 3
                },
                {
                    "version": "14.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "14.0.1+7",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 1
                }
            ],
            "mem": {
                "heap_used_in_bytes": 25539251816,
                "heap_max_in_bytes": 41875931136
            },
            "threads": 663
        },
        "fs": {
            "total_in_bytes": 4201923051520,
            "free_in_bytes": 3433782341632,
            "available_in_bytes": 3262779764736
        },
        "plugins": [],
        "network_types": {
            "transport_types": {
                "security4": 4
            },
            "http_types": {
                "security4": 4
            }
        },
        "discovery_types": {
            "zen": 4
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "deb",
                "count": 4
            }
        ],
        "ingest": {
            "number_of_pipelines": 26,
            "processor_stats": {
                "append": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "conditional": {
                    "count": 363381,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 3945
                },
                "convert": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "date": {
                    "count": 2438,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 12
                },
                "dot_expander": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "geoip": {
                    "count": 219876,
                    "failed": 107418,
                    "current": 0,
                    "time_in_millis": 21777
                },
                "grok": {
                    "count": 345930,
                    "failed": 68,
                    "current": 0,
                    "time_in_millis": 2110
                },
                "gsub": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "json": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "remove": {
                    "count": 686689,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 55
                },
                "rename": {
                    "count": 563194,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 57
                },
                "script": {
                    "count": 107447,
                    "failed": 107418,
                    "current": 0,
                    "time_in_millis": 1578
                },
                "set": {
                    "count": 217472,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 14
                },
                "split": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "user_agent": {
                    "count": 2438,
                    "failed": 3,
                    "current": 0,
                    "time_in_millis": 784
                }
            }
        }
    }
}

The cluster status is green, during the process.
Additional information, beside 3 data node, we also has 1 coordination node. but when i check the node information, it said 0 coordinationg_only on but 1 ml node.
What does this mean?

Your shard count seems pretty high for the amount of data you have. Can you look at reducing that by using _shrink or _reindex?

does this make sense? we have total 1480 indices currently and each contain 2 shard.
I already check further, from 1480 indices 90 percent is metricbeat and filebeat indices.

Do you think something wrong with my indices configuration

A high number of shards wastes resources.

Yeah.

May I suggest you look at the following resources about sizing:

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

Hi,

Thanks for reference those videos, i will check them out.
Meanwhile, to solve my issue above finnaly i use logstash to manage my bulk process into elasticsearch.

My consumer just put the data which will be indexed / updated to elastic into a file, then i use filebeat to push to logstash before stored in elastic.

Anyway, https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing this video give a clear guidance and simulation how to measure our cluster sizing.

Thanks!

What are you using logstash for in that case? Can't you just use filebeat to push to elasticsearch ?

my consideration to using logstash is because

  1. i need to index the doc into indice <indice>-<yyyy.mm.dd> format, which yyyy.mm.dd is refer to created_at field not the time when filebeat push to elastic
  2. i need to keep the logic index or update when the _id already exists. Does the filebeat able to accomodate this?

Please give me your advise

Thanks

I think you can use the following feature to extract the right date from your fields:

I'm not sure but may be you can extract the _id from your fields by setting @metadata._id field using:

But it would be better to ask for this question in #elastic-stack:beats with the #filebeat tag.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.