Refresh API taking too long

Hi folks,

I am having some 409 - versioning conflict problems when using deleteByQuery so I decided to run a POST /_refresh when I receive a 409 and just before trying again.

The rate of operations recovery using this is very good, and the cluster seems not impacted too much coz I don't have that many errors.

However, there is very common for me to see that request taking up to TEN SECONDS to finish
and thats very bad.

I have ONE index with two primaries and one replica, 24m docs, taking 17.4 GB
My index rate is 25.43 /s at peak time
My search rate is 112.35 /s at peak time

Heap and caches are stable
Memory pressure is under 60% at peak time
CPU usage is under 40% on peak time

Is it "normal" to see a refresh operation taking that long?

May you please help me to understand this?

Thanks
Rafael

Welcome to our community! :smiley:

What sort of storage are you using?
What is the output from the _cluster/stats?pretty&human API?

Thanks warkolm

This is on Elastic Cloud

... but still I can have this output from the stats API ...

{
    "_nodes": {
        "total": 3,
        "successful": 3,
        "failed": 0
    },
    "cluster_name": "XXXXXX",
    "cluster_uuid": "XXXXX",
    "timestamp": 1684320383008,
    "status": "green",
    "indices": {
        "count": 29,
        "shards": {
            "total": 60,
            "primaries": 30,
            "replication": 1.0,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 4,
                    "avg": 2.0689655172413794
                },
                "primaries": {
                    "min": 1,
                    "max": 2,
                    "avg": 1.0344827586206897
                },
                "replication": {
                    "min": 1.0,
                    "max": 1.0,
                    "avg": 1.0
                }
            }
        },
        "docs": {
            "count": 25179424,
            "deleted": 2416881
        },
        "store": {
            "size": "18.2gb",
            "size_in_bytes": 19626975383,
            "total_data_set_size": "18.2gb",
            "total_data_set_size_in_bytes": 19626975383,
            "reserved": "0b",
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size": "317.6mb",
            "memory_size_in_bytes": 333087624,
            "evictions": 0
        },
        "query_cache": {
            "memory_size": "11.3mb",
            "memory_size_in_bytes": 11877519,
            "total_count": 105338067,
            "hit_count": 2874090,
            "miss_count": 102463977,
            "cache_size": 46804,
            "cache_count": 46805,
            "evictions": 1
        },
        "completion": {
            "size": "0b",
            "size_in_bytes": 0
        },
        "segments": {
            "count": 336,
            "memory": "0b",
            "memory_in_bytes": 0,
            "terms_memory": "0b",
            "terms_memory_in_bytes": 0,
            "stored_fields_memory": "0b",
            "stored_fields_memory_in_bytes": 0,
            "term_vectors_memory": "0b",
            "term_vectors_memory_in_bytes": 0,
            "norms_memory": "0b",
            "norms_memory_in_bytes": 0,
            "points_memory": "0b",
            "points_memory_in_bytes": 0,
            "doc_values_memory": "0b",
            "doc_values_memory_in_bytes": 0,
            "index_writer_memory": "1.1mb",
            "index_writer_memory_in_bytes": 1230240,
            "version_map_memory": "47.4kb",
            "version_map_memory_in_bytes": 48549,
            "fixed_bit_set": "5.8kb",
            "fixed_bit_set_memory_in_bytes": 5976,
            "max_unsafe_auto_id_timestamp": 1684265972184,
            "file_sizes": {}
        },
        "mappings": {
            "total_field_count": 727,
            "total_deduplicated_field_count": 206,
            "total_deduplicated_mapping_size": "22.7kb",
            "total_deduplicated_mapping_size_in_bytes": 23313,
            "field_types": [
                {
                    "name": "boolean",
                    "count": 21,
                    "index_count": 13,
                    "script_count": 0
                },
                {
                    "name": "constant_keyword",
                    "count": 6,
                    "index_count": 2,
                    "script_count": 0
                },
                {
                    "name": "date",
                    "count": 58,
                    "index_count": 16,
                    "script_count": 0
                },
                {
                    "name": "float",
                    "count": 10,
                    "index_count": 5,
                    "script_count": 0
                },
                {
                    "name": "integer",
                    "count": 1,
                    "index_count": 1,
                    "script_count": 0
                },
                {
                    "name": "ip",
                    "count": 2,
                    "index_count": 2,
                    "script_count": 0
                },
                {
                    "name": "keyword",
                    "count": 348,
                    "index_count": 16,
                    "script_count": 0
                },
                {
                    "name": "long",
                    "count": 119,
                    "index_count": 9,
                    "script_count": 0
                },
                {
                    "name": "nested",
                    "count": 5,
                    "index_count": 5,
                    "script_count": 0
                },
                {
                    "name": "object",
                    "count": 108,
                    "index_count": 15,
                    "script_count": 0
                },
                {
                    "name": "text",
                    "count": 44,
                    "index_count": 16,
                    "script_count": 0
                },
                {
                    "name": "version",
                    "count": 5,
                    "index_count": 5,
                    "script_count": 0
                }
            ],
            "runtime_field_types": []
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [],
            "built_in_char_filters": [],
            "built_in_tokenizers": [],
            "built_in_filters": [],
            "built_in_analyzers": [
                {
                    "name": "whitespace",
                    "count": 1,
                    "index_count": 1
                }
            ]
        },
        "versions": [
            {
                "version": "8.6.1",
                "index_count": 12,
                "primary_shard_count": 12,
                "total_primary_size": "47.7mb",
                "total_primary_bytes": 50042260
            },
            {
                "version": "8.6.2",
                "index_count": 17,
                "primary_shard_count": 18,
                "total_primary_size": "8.9gb",
                "total_primary_bytes": 9599977028
            }
        ],
        "search": {
            "total": 2165936,
            "queries": {
                "match_phrase": 20,
                "bool": 2165680,
                "terms": 3608,
                "prefix": 1,
                "match": 2143451,
                "match_phrase_prefix": 1,
                "exists": 6400,
                "range": 2112471,
                "term": 18868,
                "nested": 1,
                "simple_query_string": 1815,
                "wildcard": 1
            },
            "sections": {
                "runtime_mappings": 10899,
                "query": 2165857,
                "terminate_after": 1,
                "pit": 260,
                "_source": 100,
                "aggs": 82819
            }
        }
    },
    "nodes": {
        "count": {
            "total": 3,
            "coordinating_only": 0,
            "data": 0,
            "data_cold": 0,
            "data_content": 2,
            "data_frozen": 0,
            "data_hot": 2,
            "data_warm": 0,
            "index": 0,
            "ingest": 2,
            "master": 3,
            "ml": 0,
            "remote_cluster_client": 2,
            "search": 0,
            "transform": 2,
            "voting_only": 1
        },
        "versions": [
            "8.6.2"
        ],
        "os": {
            "available_processors": 6,
            "allocated_processors": 6,
            "names": [
                {
                    "name": "Linux",
                    "count": 3
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "Ubuntu 20.04.5 LTS",
                    "count": 3
                }
            ],
            "architectures": [
                {
                    "arch": "amd64",
                    "count": 3
                }
            ],
            "mem": {
                "total": "9gb",
                "total_in_bytes": 9663676416,
                "adjusted_total": "7.9gb",
                "adjusted_total_in_bytes": 8531214336,
                "free": "108.6mb",
                "free_in_bytes": 113897472,
                "used": "8.8gb",
                "used_in_bytes": 9549778944,
                "free_percent": 1,
                "used_percent": 99
            }
        },
        "process": {
            "cpu": {
                "percent": 26
            },
            "open_file_descriptors": {
                "min": 500,
                "max": 752,
                "avg": 663
            }
        },
        "jvm": {
            "max_uptime": "12.5d",
            "max_uptime_in_millis": 1085303680,
            "versions": [
                {
                    "version": "19.0.2",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "19.0.2+7-44",
                    "vm_vendor": "Oracle Corporation",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 3
                }
            ],
            "mem": {
                "heap_used": "2gb",
                "heap_used_in_bytes": 2199025080,
                "heap_max": "3.9gb",
                "heap_max_in_bytes": 4198498304
            },
            "threads": 291
        },
        "fs": {
            "total": "405gb",
            "total_in_bytes": 434865438720,
            "free": "385.6gb",
            "free_in_bytes": 414124347392,
            "available": "385.6gb",
            "available_in_bytes": 414124347392
        },
        "plugins": [],
        "network_types": {
            "transport_types": {
                "security4": 3
            },
            "http_types": {
                "security4": 3
            }
        },
        "discovery_types": {
            "multi-node": 3
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "docker",
                "count": 3
            }
        ],
        "ingest": {
            "number_of_pipelines": 7,
            "processor_stats": {
                "date": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "pipeline": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "remove": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "script": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "set_security_user": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                }
            }
        },
        "indexing_pressure": {
            "memory": {
                "current": {
                    "combined_coordinating_and_primary": "0b",
                    "combined_coordinating_and_primary_in_bytes": 0,
                    "coordinating": "0b",
                    "coordinating_in_bytes": 0,
                    "primary": "0b",
                    "primary_in_bytes": 0,
                    "replica": "0b",
                    "replica_in_bytes": 0,
                    "all": "0b",
                    "all_in_bytes": 0
                },
                "total": {
                    "combined_coordinating_and_primary": "0b",
                    "combined_coordinating_and_primary_in_bytes": 0,
                    "coordinating": "0b",
                    "coordinating_in_bytes": 0,
                    "primary": "0b",
                    "primary_in_bytes": 0,
                    "replica": "0b",
                    "replica_in_bytes": 0,
                    "all": "0b",
                    "all_in_bytes": 0,
                    "coordinating_rejections": 0,
                    "primary_rejections": 0,
                    "replica_rejections": 0
                },
                "limit": "0b",
                "limit_in_bytes": 0
            }
        }
    }
}

Can you provide details on the cluster config for that?

Doing my best attempt at guessing what piece of information you are missing ...

This is a Elastic Cloud instance with data instances in two AZs and a tiebreaker instance in a third AZ. Only HOT tier is in use atm. Data instances are GCP.ES.DATAHOT.N2.68X32X45 meaning 8 GB RAM, 360 GB disk, and "up to 8vCPU" each. Both Memory pressure and CPU usage are under 50% at peak.

I have one index configured with two primaries and replica set to one. So both data nodes are doing ingestion and have a complete set of the data (primary A + replica B) to respond to queries standalone.

As said:
index rate is 25.43 /s at peak time
search rate is 112.35 /s at peak time

Both index and search response times are good (way below 100ms)

All looks nice to my eyes ...

With that index and query rate, we have a certain amount of updateByQuery and (much less) deleteByQuery operations. updateByQuery seems to perform nicely. deleteByQuery including a particular document always comes after an updateByQuery (in our document lifecycle documents are updated ONCE then may be deleted after that update. Where "after" could be: never, at some point, or close in time after few seconds.)

deleteByQuery responds with 409 frequently so I implemented a call to refresh before attempting the same delete request again which seems to succeed 99% of the time. However I observed some of the operations involving refresh spiking in latency up to 10 seconds.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.