Highlighting performance issues with stored field and fvh highlighter

leandrojmp · February 14, 2024, 1:58pm

There seems to be some issue with the highlighting as you can see in this similar post.

There are some github issues linked, mainly this one:

github.com/elastic/elasticsearch

Highlighting can take 50% to 99% of search time

opened 08:59PM - 11 Dec 23 UTC

quux00

>bug :Search/Highlighting Team:Search

### Elasticsearch Version 8.10.4 and 8.13.0-SNAPSHOT ### Installed Plugins… _No response_ ### Java Version _bundled_ ### OS Version ESS Cloud and MacOS ### Problem Description A customer has reported that highlighting (added by Kibana) is causing their queries in Kibana to extremely slowly. Profiling of the queries showed that 99% of the search time is spent in the Highlight code paths. In their case, the ConstantScoreQuery took less than 1 second, but the highlighting runs for ~850s (this is reproducible). I was able to reproduce this to some degree locally (details below), where a the highlighting takes ~50% of the total search time. Root cause is not known. Key points include: - the query being run (by Kibana) is a match_phrase against `match_only_text`. ### Steps to Reproduce Here is how I (partially) reproduced the issue locally. Use the following mapping to crate an index where `message` has `match_only_text` and 1 shard is created (I didn't test with more than 1 shard). <details> <summary>Mapping (toggle to view)</summary> <pre> { "settings": { "index" : { "number_of_shards": 1, "number_of_replicas": 0 } }, "mappings": { "properties": { "@timestamp": { "type": "date" }, "message": { "type": "match_only_text" }, "kubernetes": { "properties": { "container": { "properties": { "name": { "type": "keyword", "ignore_above": 1024 } } }, "daemonset": { "properties": { "name": { "type": "keyword", "ignore_above": 1024 } } }, "deployment": { "properties": { "name": { "type": "keyword", "ignore_above": 1024 } } } } }, "bytes_sent": { "type": "long" }, "content_type": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "geoip_location_lat": { "type": "float" }, "geoip_location_lon": { "type": "float" }, "is_https": { "type": "boolean" }, "request": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "response": { "type": "long" }, "runtime_ms": { "type": "long" }, "url": { "type": "keyword" }, "user_Agent": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "verb": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } </pre> </details> Populate this index with at least 2.5 million records in scope for the search (see my table in comments below). <details> <summary>Data generation script (toggle to view)</summary> ``` import argparse import ndjson from faker import Faker from datetime import datetime, timedelta import random def generate_fake_data(num_lines): fake = Faker() data = [] timestamp = datetime.strptime("2021-04-06T14:00:00", "%Y-%m-%dT%H:%M:%S") for _ in range(num_lines): timestamp_str = timestamp.strftime("%Y-%m-%dT%H:%M:%S.000Z") timestamp += timedelta(seconds=5) # Generate a message with at least 85 words message_word_count = random.randint(85, 225) message = fake.paragraph(nb_sentences=1, ext_word_list=None) while len(message.split()) < message_word_count: message += " " + fake.paragraph(nb_sentences=1, ext_word_list=None) ri = random.randint(1,4) if ri == 2: message += ". request errored 123 forty-five 12:45 foobar " + fake.word() + " " + fake.word() elif ri == 3: message = "12345 California request completed - " + message deployment_name = "service-integrations" if ri > 2: deployment_name = fake.word() json_data = { "@timestamp": timestamp_str, "user_Agent": fake.user_agent(), "url": fake.uri_path(), "content_type": fake.mime_type(), "is_https": fake.boolean(), "response": fake.random_int(min=100, max=599), "verb": fake.http_method(), "geoip_location_lon": float(fake.longitude()), "geoip_location_lat": float(fake.latitude()), "bytes_sent": fake.random_int(min=1000, max=50000), "runtime_ms": fake.random_int(min=100, max=1000), "message": message, "kubernetes": { "container": {"name": fake.word()}, "daemonset": {"name": fake.word()}, "deployment": {"name": deployment_name} } } data.append(json_data) return data def main(): parser = argparse.ArgumentParser(description="Generate ndjson with Faker") parser.add_argument("num_lines", type=int, help="Number of lines to produce") args = parser.parse_args() fake_data = generate_fake_data(args.num_lines + 1) output_file_name = f"sdh7687-data-{args.num_lines}.json" with open(output_file_name, "w") as f: ndjson.dump(fake_data, f, ensure_ascii=False) if __name__ == "__main__": main() ``` </details> <details> <summary>Loading script (toggle to view)</summary> ``` #!/bin/bash if [ "$#" -ne 2 ]; then echo "Usage: $0 <port> <input_file>" exit 1 fi port="$1" input="$2" output="bulk.log" counter=0 max_rows=10000 create='{"create": {}}' bulk_data=$'\n' default_port=9200 echo "Using port: $port" echo "Creating sdh7687 index with mappings to http://localhost:$port ..." /usr/bin/curl -s -XPUT "http://localhost:$port/sdh7687" -H 'Content-Type: application/json' --insecure --data-binary @sdh7687-mappings-not-match-only-text.json echo "" echo "Reading sdh7687 events from $input..." while read -r log_event do let "counter=counter+1" bulk_data+="$create"$'\n'"$log_event"$'\n' if [ $counter -eq $max_rows ] then echo "Indexing $counter documents..." bulk_data+=$'\n' echo "$bulk_data" | tee temp.json > /dev/null /usr/bin/curl -XPOST "http://localhost:$port/sdh7687/_bulk" -H 'Content-Type: application/x-ndjson' -# --progress-bar --insecure --data-binary @temp.json >> "$output" rm -rf temp.json counter=0 bulk_data=$'\n' fi done < "$input" if [ $counter -lt $max_rows ] && [ $counter -gt 0 ] then echo "Indexing $counter documents..." bulk_data+=$'\n' echo "$bulk_data" | tee temp.json > /dev/null /usr/bin/curl -XPOST "http://localhost:$port/sdh7687/_bulk" -H 'Content-Type: application/x-ndjson' -# --progress-bar --insecure --data-binary @temp.json >> "$output" rm -rf temp.json fi ``` </details> Note I ran the above two scripts dozens of times to create data files with 100,000 to 200,000 entries, so that most of the data fell within a small-ish @timestamp range to index millions of documents. Run the following query (has profiling and highlighting turned on). <details> <summary>ES Query (toggle to view)</summary> <pre> POST sdh7687/_async_search { "profile": true, "track_total_hits": false, "sort": [ { "@timestamp": { "order": "desc", "unmapped_type": "boolean" } } ], "fields": [ { "field": "*", "include_unmapped": "true" }, { "field": "@timestamp", "format": "strict_date_optional_time" } ], "size": 10, "version": true, "script_fields": {}, "stored_fields": [ "*" ], "runtime_mappings": {}, "_source": false, "query": { "bool": { "must": [], "filter": [ { "range": { "@timestamp": { "format": "strict_date_optional_time", "gte": "2021-01-02T06:40:00.000Z", "lte": "2021-04-10T06:49:07.221Z" } } }, { "match_phrase": { "kubernetes.deployment.name": "service-integrations" } } ], "should": [], "must_not": [ { "match_phrase": { "message": "request errored" } }, { "match_phrase": { "message": "request completed" } } ] } }, "highlight": { "pre_tags": [ "@kibana-highlighted-field@" ], "post_tags": [ "@/kibana-highlighted-field@" ], "fields": { "*": {} }, "fragment_size": 2147483647 } } </pre> </details> ### Logs (if relevant) _No response_

Your issue may be related to this.

Topic		Replies	Views
Fast vector highlighter (fvh) making searches slower Elasticsearch	7	1214	December 29, 2021
Elastic query takes over 1 minute due to time spent in "HighlightPhase" Elasticsearch	6	367	November 1, 2023
Unified Highlighter is too slow Elasticsearch	9	2081	June 27, 2018
Comparing Plain, FVH and Postings Highlighter Performance Elasticsearch	6	2028	July 6, 2017
Highlighting with fvh and fuzziness taking long time in Elasticsearch Elasticsearch	3	769	November 15, 2018

Highlighting performance issues with stored field and fvh highlighter

Related topics