Refresh API taking too long

Hi folks,

I am having some 409 - versioning conflict problems when using deleteByQuery so I decided to run a POST /_refresh when I receive a 409 and just before trying again.

The rate of operations recovery using this is very good, and the cluster seems not impacted too much coz I don't have that many errors.

However, there is very common for me to see that request taking up to TEN SECONDS to finish
and thats very bad.

I have ONE index with two primaries and one replica, 24m docs, taking 17.4 GB
My index rate is 25.43 /s at peak time
My search rate is 112.35 /s at peak time

Heap and caches are stable
Memory pressure is under 60% at peak time
CPU usage is under 40% on peak time

Is it "normal" to see a refresh operation taking that long?

May you please help me to understand this?

Thanks
Rafael

Welcome to our community! :smiley:

What sort of storage are you using?
What is the output from the _cluster/stats?pretty&human API?

Thanks warkolm

This is on Elastic Cloud

... but still I can have this output from the stats API ...

{
    "_nodes": {
        "total": 3,
        "successful": 3,
        "failed": 0
    },
    "cluster_name": "XXXXXX",
    "cluster_uuid": "XXXXX",
    "timestamp": 1684320383008,
    "status": "green",
    "indices": {
        "count": 29,
        "shards": {
            "total": 60,
            "primaries": 30,
            "replication": 1.0,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 4,
                    "avg": 2.0689655172413794
                },
                "primaries": {
                    "min": 1,
                    "max": 2,
                    "avg": 1.0344827586206897
                },
                "replication": {
                    "min": 1.0,
                    "max": 1.0,
                    "avg": 1.0
                }
            }
        },
        "docs": {
            "count": 25179424,
            "deleted": 2416881
        },
        "store": {
            "size": "18.2gb",
            "size_in_bytes": 19626975383,
            "total_data_set_size": "18.2gb",
            "total_data_set_size_in_bytes": 19626975383,
            "reserved": "0b",
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size": "317.6mb",
            "memory_size_in_bytes": 333087624,
            "evictions": 0
        },
        "query_cache": {
            "memory_size": "11.3mb",
            "memory_size_in_bytes": 11877519,
            "total_count": 105338067,
            "hit_count": 2874090,
            "miss_count": 102463977,
            "cache_size": 46804,
            "cache_count": 46805,
            "evictions": 1
        },
        "completion": {
            "size": "0b",
            "size_in_bytes": 0
        },
        "segments": {
            "count": 336,
            "memory": "0b",
            "memory_in_bytes": 0,
            "terms_memory": "0b",
            "terms_memory_in_bytes": 0,
            "stored_fields_memory": "0b",
            "stored_fields_memory_in_bytes": 0,
            "term_vectors_memory": "0b",
            "term_vectors_memory_in_bytes": 0,
            "norms_memory": "0b",
            "norms_memory_in_bytes": 0,
            "points_memory": "0b",
            "points_memory_in_bytes": 0,
            "doc_values_memory": "0b",
            "doc_values_memory_in_bytes": 0,
            "index_writer_memory": "1.1mb",
            "index_writer_memory_in_bytes": 1230240,
            "version_map_memory": "47.4kb",
            "version_map_memory_in_bytes": 48549,
            "fixed_bit_set": "5.8kb",
            "fixed_bit_set_memory_in_bytes": 5976,
            "max_unsafe_auto_id_timestamp": 1684265972184,
            "file_sizes": {}
        },
        "mappings": {
            "total_field_count": 727,
            "total_deduplicated_field_count": 206,
            "total_deduplicated_mapping_size": "22.7kb",
            "total_deduplicated_mapping_size_in_bytes": 23313,
            "field_types": [
                {
                    "name": "boolean",
                    "count": 21,
                    "index_count": 13,
                    "script_count": 0
                },
                {
                    "name": "constant_keyword",
                    "count": 6,
                    "index_count": 2,
                    "script_count": 0
                },
                {
                    "name": "date",
                    "count": 58,
                    "index_count": 16,
                    "script_count": 0
                },
                {
                    "name": "float",
                    "count": 10,
                    "index_count": 5,
                    "script_count": 0
                },
                {
                    "name": "integer",
                    "count": 1,
                    "index_count": 1,
                    "script_count": 0
                },
                {
                    "name": "ip",
                    "count": 2,
                    "index_count": 2,
                    "script_count": 0
                },
                {
                    "name": "keyword",
                    "count": 348,
                    "index_count": 16,
                    "script_count": 0
                },
                {
                    "name": "long",
                    "count": 119,
                    "index_count": 9,
                    "script_count": 0
                },
                {
                    "name": "nested",
                    "count": 5,
                    "index_count": 5,
                    "script_count": 0
                },
                {
                    "name": "object",
                    "count": 108,
                    "index_count": 15,
                    "script_count": 0
                },
                {
                    "name": "text",
                    "count": 44,
                    "index_count": 16,
                    "script_count": 0
                },
                {
                    "name": "version",
                    "count": 5,
                    "index_count": 5,
                    "script_count": 0
                }
            ],
            "runtime_field_types": []
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [],
            "built_in_char_filters": [],
            "built_in_tokenizers": [],
            "built_in_filters": [],
            "built_in_analyzers": [
                {
                    "name": "whitespace",
                    "count": 1,
                    "index_count": 1
                }
            ]
        },
        "versions": [
            {
                "version": "8.6.1",
                "index_count": 12,
                "primary_shard_count": 12,
                "total_primary_size": "47.7mb",
                "total_primary_bytes": 50042260
            },
            {
                "version": "8.6.2",
                "index_count": 17,
                "primary_shard_count": 18,
                "total_primary_size": "8.9gb",
                "total_primary_bytes": 9599977028
            }
        ],
        "search": {
            "total": 2165936,
            "queries": {
                "match_phrase": 20,
                "bool": 2165680,
                "terms": 3608,
                "prefix": 1,
                "match": 2143451,
                "match_phrase_prefix": 1,
                "exists": 6400,
                "range": 2112471,
                "term": 18868,
                "nested": 1,
                "simple_query_string": 1815,
                "wildcard": 1
            },
            "sections": {
                "runtime_mappings": 10899,
                "query": 2165857,
                "terminate_after": 1,
                "pit": 260,
                "_source": 100,
                "aggs": 82819
            }
        }
    },
    "nodes": {
        "count": {
            "total": 3,
            "coordinating_only": 0,
            "data": 0,
            "data_cold": 0,
            "data_content": 2,
            "data_frozen": 0,
            "data_hot": 2,
            "data_warm": 0,
            "index": 0,
            "ingest": 2,
            "master": 3,
            "ml": 0,
            "remote_cluster_client": 2,
            "search": 0,
            "transform": 2,
            "voting_only": 1
        },
        "versions": [
            "8.6.2"
        ],
        "os": {
            "available_processors": 6,
            "allocated_processors": 6,
            "names": [
                {
                    "name": "Linux",
                    "count": 3
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "Ubuntu 20.04.5 LTS",
                    "count": 3
                }
            ],
            "architectures": [
                {
                    "arch": "amd64",
                    "count": 3
                }
            ],
            "mem": {
                "total": "9gb",
                "total_in_bytes": 9663676416,
                "adjusted_total": "7.9gb",
                "adjusted_total_in_bytes": 8531214336,
                "free": "108.6mb",
                "free_in_bytes": 113897472,
                "used": "8.8gb",
                "used_in_bytes": 9549778944,
                "free_percent": 1,
                "used_percent": 99
            }
        },
        "process": {
            "cpu": {
                "percent": 26
            },
            "open_file_descriptors": {
                "min": 500,
                "max": 752,
                "avg": 663
            }
        },
        "jvm": {
            "max_uptime": "12.5d",
            "max_uptime_in_millis": 1085303680,
            "versions": [
                {
                    "version": "19.0.2",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "19.0.2+7-44",
                    "vm_vendor": "Oracle Corporation",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 3
                }
            ],
            "mem": {
                "heap_used": "2gb",
                "heap_used_in_bytes": 2199025080,
                "heap_max": "3.9gb",
                "heap_max_in_bytes": 4198498304
            },
            "threads": 291
        },
        "fs": {
            "total": "405gb",
            "total_in_bytes": 434865438720,
            "free": "385.6gb",
            "free_in_bytes": 414124347392,
            "available": "385.6gb",
            "available_in_bytes": 414124347392
        },
        "plugins": [],
        "network_types": {
            "transport_types": {
                "security4": 3
            },
            "http_types": {
                "security4": 3
            }
        },
        "discovery_types": {
            "multi-node": 3
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "docker",
                "count": 3
            }
        ],
        "ingest": {
            "number_of_pipelines": 7,
            "processor_stats": {
                "date": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "pipeline": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "remove": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "script": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "set_security_user": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                }
            }
        },
        "indexing_pressure": {
            "memory": {
                "current": {
                    "combined_coordinating_and_primary": "0b",
                    "combined_coordinating_and_primary_in_bytes": 0,
                    "coordinating": "0b",
                    "coordinating_in_bytes": 0,
                    "primary": "0b",
                    "primary_in_bytes": 0,
                    "replica": "0b",
                    "replica_in_bytes": 0,
                    "all": "0b",
                    "all_in_bytes": 0
                },
                "total": {
                    "combined_coordinating_and_primary": "0b",
                    "combined_coordinating_and_primary_in_bytes": 0,
                    "coordinating": "0b",
                    "coordinating_in_bytes": 0,
                    "primary": "0b",
                    "primary_in_bytes": 0,
                    "replica": "0b",
                    "replica_in_bytes": 0,
                    "all": "0b",
                    "all_in_bytes": 0,
                    "coordinating_rejections": 0,
                    "primary_rejections": 0,
                    "replica_rejections": 0
                },
                "limit": "0b",
                "limit_in_bytes": 0
            }
        }
    }
}

Can you provide details on the cluster config for that?

Doing my best attempt at guessing what piece of information you are missing ...

This is a Elastic Cloud instance with data instances in two AZs and a tiebreaker instance in a third AZ. Only HOT tier is in use atm. Data instances are GCP.ES.DATAHOT.N2.68X32X45 meaning 8 GB RAM, 360 GB disk, and "up to 8vCPU" each. Both Memory pressure and CPU usage are under 50% at peak.

I have one index configured with two primaries and replica set to one. So both data nodes are doing ingestion and have a complete set of the data (primary A + replica B) to respond to queries standalone.

As said:
index rate is 25.43 /s at peak time
search rate is 112.35 /s at peak time

Both index and search response times are good (way below 100ms)

All looks nice to my eyes ...

With that index and query rate, we have a certain amount of updateByQuery and (much less) deleteByQuery operations. updateByQuery seems to perform nicely. deleteByQuery including a particular document always comes after an updateByQuery (in our document lifecycle documents are updated ONCE then may be deleted after that update. Where "after" could be: never, at some point, or close in time after few seconds.)

deleteByQuery responds with 409 frequently so I implemented a call to refresh before attempting the same delete request again which seems to succeed 99% of the time. However I observed some of the operations involving refresh spiking in latency up to 10 seconds.