### Elasticsearch Version
8.10.4 and 8.13.0-SNAPSHOT
### Installed Plugins…
_No response_
### Java Version
_bundled_
### OS Version
ESS Cloud and MacOS
### Problem Description
A customer has reported that highlighting (added by Kibana) is causing their queries in Kibana to extremely slowly. Profiling of the queries showed that 99% of the search time is spent in the Highlight code paths. In their case, the ConstantScoreQuery took less than 1 second, but the highlighting runs for ~850s (this is reproducible).
I was able to reproduce this to some degree locally (details below), where a the highlighting takes ~50% of the total search time.
Root cause is not known. Key points include:
- the query being run (by Kibana) is a match_phrase against `match_only_text`.
### Steps to Reproduce
Here is how I (partially) reproduced the issue locally.
Use the following mapping to crate an index where `message` has `match_only_text` and 1 shard is created (I didn't test with more than 1 shard).
<details>
<summary><b>Mapping</b> (toggle to view)</summary>
<pre>
{
"settings": {
"index" : {
"number_of_shards": 1,
"number_of_replicas": 0
}
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"message": {
"type": "match_only_text"
},
"kubernetes": {
"properties": {
"container": {
"properties": {
"name": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"daemonset": {
"properties": {
"name": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"deployment": {
"properties": {
"name": {
"type": "keyword",
"ignore_above": 1024
}
}
}
}
},
"bytes_sent": {
"type": "long"
},
"content_type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"geoip_location_lat": {
"type": "float"
},
"geoip_location_lon": {
"type": "float"
},
"is_https": {
"type": "boolean"
},
"request": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"response": {
"type": "long"
},
"runtime_ms": {
"type": "long"
},
"url": {
"type": "keyword"
},
"user_Agent": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"verb": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
</pre>
</details>
Populate this index with at least 2.5 million records in scope for the search (see my table in comments below).
<details>
<summary><b>Data generation script</b> (toggle to view)</summary>
```
import argparse
import ndjson
from faker import Faker
from datetime import datetime, timedelta
import random
def generate_fake_data(num_lines):
fake = Faker()
data = []
timestamp = datetime.strptime("2021-04-06T14:00:00", "%Y-%m-%dT%H:%M:%S")
for _ in range(num_lines):
timestamp_str = timestamp.strftime("%Y-%m-%dT%H:%M:%S.000Z")
timestamp += timedelta(seconds=5)
# Generate a message with at least 85 words
message_word_count = random.randint(85, 225)
message = fake.paragraph(nb_sentences=1, ext_word_list=None)
while len(message.split()) < message_word_count:
message += " " + fake.paragraph(nb_sentences=1, ext_word_list=None)
ri = random.randint(1,4)
if ri == 2:
message += ". request errored 123 forty-five 12:45 foobar " + fake.word() + " " + fake.word()
elif ri == 3:
message = "12345 California request completed - " + message
deployment_name = "service-integrations"
if ri > 2:
deployment_name = fake.word()
json_data = {
"@timestamp": timestamp_str,
"user_Agent": fake.user_agent(),
"url": fake.uri_path(),
"content_type": fake.mime_type(),
"is_https": fake.boolean(),
"response": fake.random_int(min=100, max=599),
"verb": fake.http_method(),
"geoip_location_lon": float(fake.longitude()),
"geoip_location_lat": float(fake.latitude()),
"bytes_sent": fake.random_int(min=1000, max=50000),
"runtime_ms": fake.random_int(min=100, max=1000),
"message": message,
"kubernetes": {
"container": {"name": fake.word()},
"daemonset": {"name": fake.word()},
"deployment": {"name": deployment_name}
}
}
data.append(json_data)
return data
def main():
parser = argparse.ArgumentParser(description="Generate ndjson with Faker")
parser.add_argument("num_lines", type=int, help="Number of lines to produce")
args = parser.parse_args()
fake_data = generate_fake_data(args.num_lines + 1)
output_file_name = f"sdh7687-data-{args.num_lines}.json"
with open(output_file_name, "w") as f:
ndjson.dump(fake_data, f, ensure_ascii=False)
if __name__ == "__main__":
main()
```
</details>
<details>
<summary><b>Loading script</b> (toggle to view)</summary>
```
#!/bin/bash
if [ "$#" -ne 2 ]; then
echo "Usage: $0 <port> <input_file>"
exit 1
fi
port="$1"
input="$2"
output="bulk.log"
counter=0
max_rows=10000
create='{"create": {}}'
bulk_data=$'\n'
default_port=9200
echo "Using port: $port"
echo "Creating sdh7687 index with mappings to http://localhost:$port ..."
/usr/bin/curl -s -XPUT "http://localhost:$port/sdh7687" -H 'Content-Type: application/json' --insecure --data-binary @sdh7687-mappings-not-match-only-text.json
echo ""
echo "Reading sdh7687 events from $input..."
while read -r log_event
do
let "counter=counter+1"
bulk_data+="$create"$'\n'"$log_event"$'\n'
if [ $counter -eq $max_rows ]
then
echo "Indexing $counter documents..."
bulk_data+=$'\n'
echo "$bulk_data" | tee temp.json > /dev/null
/usr/bin/curl -XPOST "http://localhost:$port/sdh7687/_bulk" -H 'Content-Type: application/x-ndjson' -# --progress-bar --insecure --data-binary @temp.json >> "$output"
rm -rf temp.json
counter=0
bulk_data=$'\n'
fi
done < "$input"
if [ $counter -lt $max_rows ] && [ $counter -gt 0 ]
then
echo "Indexing $counter documents..."
bulk_data+=$'\n'
echo "$bulk_data" | tee temp.json > /dev/null
/usr/bin/curl -XPOST "http://localhost:$port/sdh7687/_bulk" -H 'Content-Type: application/x-ndjson' -# --progress-bar --insecure --data-binary @temp.json >> "$output"
rm -rf temp.json
fi
```
</details>
Note I ran the above two scripts dozens of times to create data files with 100,000 to 200,000 entries, so that most of the data fell within a small-ish @timestamp range to index millions of documents.
Run the following query (has profiling and highlighting turned on).
<details>
<summary><b>ES Query</b> (toggle to view)</summary>
<pre>
POST sdh7687/_async_search
{
"profile": true,
"track_total_hits": false,
"sort": [
{
"@timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"fields": [
{
"field": "*",
"include_unmapped": "true"
},
{
"field": "@timestamp",
"format": "strict_date_optional_time"
}
],
"size": 10,
"version": true,
"script_fields": {},
"stored_fields": [
"*"
],
"runtime_mappings": {},
"_source": false,
"query": {
"bool": {
"must": [],
"filter": [
{
"range": {
"@timestamp": {
"format": "strict_date_optional_time",
"gte": "2021-01-02T06:40:00.000Z",
"lte": "2021-04-10T06:49:07.221Z"
}
}
},
{
"match_phrase": {
"kubernetes.deployment.name": "service-integrations"
}
}
],
"should": [],
"must_not": [
{
"match_phrase": {
"message": "request errored"
}
},
{
"match_phrase": {
"message": "request completed"
}
}
]
}
},
"highlight": {
"pre_tags": [
"@kibana-highlighted-field@"
],
"post_tags": [
"@/kibana-highlighted-field@"
],
"fields": {
"*": {}
},
"fragment_size": 2147483647
}
}
</pre>
</details>
### Logs (if relevant)
_No response_