Limiting queries from elasticsearch service side

Hello,

I would like to know if there is a way to limit incoming queries from the Elasticsearch service side, for example, restrict total hits, size, or searched shards to a specific value?

In our environment, different teams depend on our elastic stack, and we can not control what type of queries they are generating, so it would be best to apply these limits from the service side.

The main origin of this question comes from some queries taking a long time to execute (over 10s) and causing high IO usage on our data nodes
image

A bit of info about the cluster:
15 data nodes
4712 active shards with replicas included
All servers are 8 cores, 64 RAM, all SSD

Slowlog:

{
  "type": "index_search_slowlog",
  "timestamp": "2022-02-07T14:58:20,251Z",
  "level": "WARN",
  "component": "i.s.s.query",
  "cluster.name": "elasticcluster",
  "node.name": "elastic1",
  "message": "[o-123123][9]",
  "took": "13.1s",
  "took_millis": "13122",
  "total_hits": "0 hits",
  "types": "[]",
  "stats": "[]",
  "search_type": "QUERY_THEN_FETCH",
  "total_shards": "350",
  "source": "{\"size\":0,\"query\":{\"bool\":{\"must\":[{\"exists\":{\"field\":\"request\",\"boost\":1.0}},{\"range\":{\"@timestamp\":{\"from\":null,\"to\":null,\"include_lower\":true,\"include_upper\":true,\"boost\":1.0}}},{\"match_phrase\":{\"some_string\":{\"query\":\"some_string\",\"slop\":0,\"zero_terms_query\":\"NONE\",\"boost\":1.0}}},{\"match_phrase\":{\"some_other_string\":{\"query\":\"some_other_string\",\"slop\":0,\"zero_terms_query\":\"NONE\",\"boost\":1.0}}}],\"should\":[{\"match_phrase\":{\"host\":{\"query\":\"some_third_string\",\"slop\":0,\"zero_terms_query\":\"NONE\",\"boost\":1.0}}},{\"match_phrase\":{\"host\":{\"query\":\"some_third_string\",\"slop\":0,\"zero_terms_query\":\"NONE\",\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"minimum_should_match\":\"1\",\"boost\":1.0}},\"aggregations\":{\"some_string\":{\"terms\":{\"field\":\"some_other_string.keyword\",\"size\":10,\"min_doc_count\":1,\"shard_min_doc_count\":0,\"show_term_doc_count_error\":false,\"order\":[{\"_count\":\"desc\"},{\"_key\":\"asc\"}]}}}}",
  "cluster.uuid": "cluster",
  "node.id": "node"
}

Help is highly appreciated!

No, there's currently nothing native that can do that in Elasticsearch.

1 Like

Thank you! Maybe you have any tips regarding high IO usage?

What is the output from the _cluster/stats?pretty&human API?

curl http://localhost:9200/_cluster/stats?pretty&human
{
    "_nodes": {
        "total": 20,
        "successful": 20,
        "failed": 0
    },
    "cluster_name": "cluster",
    "cluster_uuid": "25LDj96TSs2GjkmLlOn-OA",
    "timestamp": 1644311537011,
    "status": "green",
    "indices": {
        "count": 235,
        "shards": {
            "total": 4712,
            "primaries": 2356,
            "replication": 1.0,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 100,
                    "avg": 20.051063829787235
                },
                "primaries": {
                    "min": 1,
                    "max": 50,
                    "avg": 10.025531914893618
                },
                "replication": {
                    "min": 1.0,
                    "max": 1.0,
                    "avg": 1.0
                }
            }
        },
        "docs": {
            "count": 18887509358,
            "deleted": 218355
        },
        "store": {
            "size_in_bytes": 16873709361168,
            "total_data_set_size_in_bytes": 16873709361168,
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 97503838520,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 9225152079,
            "total_count": 583053762,
            "hit_count": 38065809,
            "miss_count": 544987953,
            "cache_size": 3403423,
            "cache_count": 8806295,
            "evictions": 5402872
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 62109,
            "memory_in_bytes": 649240756,
            "terms_memory_in_bytes": 474761584,
            "stored_fields_memory_in_bytes": 53829864,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 65843840,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 54805468,
            "index_writer_memory_in_bytes": 323520852,
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set_memory_in_bytes": 59416,
            "max_unsafe_auto_id_timestamp": 1644279514936,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": [
                {
                    "name": "boolean",
                    "count": 6,
                    "index_count": 5,
                    "script_count": 0
                },
                {
                    "name": "date",
                    "count": 420,
                    "index_count": 210,
                    "script_count": 0
                },
                {
                    "name": "float",
                    "count": 83,
                    "index_count": 45,
                    "script_count": 0
                },
                {
                    "name": "keyword",
                    "count": 4438,
                    "index_count": 212,
                    "script_count": 0
                },
                {
                    "name": "long",
                    "count": 1155,
                    "index_count": 201,
                    "script_count": 0
                },
                {
                    "name": "nested",
                    "count": 7,
                    "index_count": 7,
                    "script_count": 0
                },
                {
                    "name": "object",
                    "count": 1285,
                    "index_count": 75,
                    "script_count": 0
                },
                {
                    "name": "text",
                    "count": 4126,
                    "index_count": 212,
                    "script_count": 0
                },
                {
                    "name": "version",
                    "count": 5,
                    "index_count": 5,
                    "script_count": 0
                }
            ],
            "runtime_field_types": []
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [],
            "built_in_char_filters": [],
            "built_in_tokenizers": [],
            "built_in_filters": [],
            "built_in_analyzers": []
        },
        "versions": [
            {
                "version": "6.2.3",
                "index_count": 3,
                "primary_shard_count": 21,
                "total_primary_bytes": 1719407
            },
            {
                "version": "6.8.9",
                "index_count": 1,
                "primary_shard_count": 1,
                "total_primary_bytes": 353613
            },
            {
                "version": "6.8.12",
                "index_count": 3,
                "primary_shard_count": 3,
                "total_primary_bytes": 166382
            },
            {
                "version": "6.8.18",
                "index_count": 1,
                "primary_shard_count": 15,
                "total_primary_bytes": 10137
            },
            {
                "version": "7.14.0",
                "index_count": 6,
                "primary_shard_count": 33,
                "total_primary_bytes": 41386595
            },
            {
                "version": "7.15.1",
                "index_count": 19,
                "primary_shard_count": 154,
                "total_primary_bytes": 20272880
            },
            {
                "version": "7.15.2",
                "index_count": 27,
                "primary_shard_count": 243,
                "total_primary_bytes": 25559782
            },
            {
                "version": "7.16.1",
                "index_count": 70,
                "primary_shard_count": 592,
                "total_primary_bytes": 337686372021
            },
            {
                "version": "7.16.3",
                "index_count": 105,
                "primary_shard_count": 1294,
                "total_primary_bytes": 8098992303526
            }
        ]
    },
    "nodes": {
        "count": {
            "total": 20,
            "coordinating_only": 2,
            "data": 15,
            "data_cold": 0,
            "data_content": 0,
            "data_frozen": 0,
            "data_hot": 0,
            "data_warm": 0,
            "ingest": 0,
            "master": 3,
            "ml": 0,
            "remote_cluster_client": 0,
            "transform": 0,
            "voting_only": 0
        },
        "versions": [
            "7.16.3",
            "7.16.1"
        ],
        "os": {
            "available_processors": 148,
            "allocated_processors": 148,
            "names": [
                {
                    "name": "Linux",
                    "count": 20
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "CentOS Linux 7 (Core)",
                    "count": 20
                }
            ],
            "architectures": [
                {
                    "arch": "amd64",
                    "count": 20
                }
            ],
            "mem": {
                "total_in_bytes": 1094094352384,
                "free_in_bytes": 14756794368,
                "used_in_bytes": 1079337558016,
                "free_percent": 1,
                "used_percent": 99
            }
        },
        "process": {
            "cpu": {
                "percent": 326
            },
            "open_file_descriptors": {
                "min": 728,
                "max": 4422,
                "avg": 3399
            }
        },
        "jvm": {
            "max_uptime_in_millis": 4567175654,
            "versions": [
                {
                    "version": "17.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "17.0.1+12",
                    "vm_vendor": "Eclipse Adoptium",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 20
                }
            ],
            "mem": {
                "heap_used_in_bytes": 302090181016,
                "heap_max_in_bytes": 540939386880
            },
            "threads": 2252
        },
        "fs": {
            "total_in_bytes": 31030456164352,
            "free_in_bytes": 14022601719808,
            "available_in_bytes": 14022601744384
        },
        "plugins": [],
        "network_types": {
            "transport_types": {
                "netty4": 20
            },
            "http_types": {
                "netty4": 20
            }
        },
        "discovery_types": {
            "zen": 20
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "rpm",
                "count": 20
            }
        ],
        "ingest": {
            "number_of_pipelines": 3,
            "processor_stats": {
                "gsub": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "rename": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "script": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "set": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                }
            }
        }
    }
}

If you're trying to track down the cause of High CPU/IO, one option would be the slow log. This would allow you to see most of the intensive queries being run against your cluster/indices.

Another option is more of a developer side of things but making sure the developers understand the types of queries they're implementing by first having them run them through something like the search profiler. This would help make sure that the developers of the queries write efficient ones, or at least understand where their bottlenecks are.

1 Like

Big thanks for your time, @BenB196 and @warkolm!

As you see in the original post, I already posted a slowlog with one of the problematic queries, and the main question was how to limit such queries.

Fantastic information on profiler did not know such thing existed and I will run this by our devs.

The cluster is performing great regarding other indexes, but this one is more than 2TB size, and it seems that it's the only one causing high IO. Would you have any tips on how to deal with such issues? What solutions work best for you when dealing with big data sets? Maybe increased or reduced shards would help? Scaling? Data streams?

Appreciate the help, already helped a lot!

You could try to set search.allow_expensive_queries to false to disable any expensive queries.

But this could break some queries and if you are using the Kibana Alerts interface, you can't set that, it will break the alerts.

1 Like

Ah, if it is only one index that is having issues, then the additional slow log recommendation doesn't really apply as you have slow logs already turned on for that index.

Another thing that you can look at is optimizing for caching. Elasticsearch caching deep dive: Boosting query speed one cache at a time | Elastic Blog provides a good overview into caching and there is a good more that can be investigated around caching. But, if the queries are fairly similar, then having queries that can be cached may help in performance issues.

1 Like

Thank you, I will try out all of your suggestions and see what produces the best results :bowing_man:

After further investigation, we found out that these IO spikes bottlenecking our search were caused by overutilized VMware storage. Apparently, physical server could not handle such high read from several virtual machines (summing up to 1.5g/s read). We will balance these nodes and get back to you with the results.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.