When dealing with large text fields like abstract
(which may contain 500+ terms), using a bool.should
query with multiple match
or term
clauses can significantly impact performance due to the sheer number of terms Elasticsearch attempts to match.
In contrast, the more_like_this
(MLT) query internally limits this via max_query_terms
, which selects only the top-N most significant terms (default: 25, configurable). This makes MLT queries faster.
Problem:
- we need to avoid performance bottlenecks caused by unbounded or very large token sets (e.g., matching 500+ terms from a full abstract).
- Unfortunately,
bool.should
doesn’t have a native equivalent ofmax_query_terms
like MLT does.
Are there any recommended steps or parameters that can mimic max_query_terms
for use in a bool.should
query?
our sample query
POST /index1/_search?typed_keys=true { "_source": { "includes": [ "title", "doi", "publicationYear"] }, "query": { "bool": { "minimum_should_match": "1", "should": [ { "match": { "normalizedTitle": { "boost": 1.5, "analyzer": "stop", "auto_generate_synonyms_phrase_query": true, "fuzzy_transpositions": true, "max_expansions": 50, "minimum_should_match": "10%", "prefix_length": 0, "query": "${title}" } } }, { "match": { "abstract": { "boost": 1.5, "analyzer": "stop", "minimum_should_match": "20%", "query": "${abstract}" } } } ] } }, "rescore": [ { "query": { "rescore_query": { "function_score": { "boost_mode": "multiply", "functions": [ { "script_score": { "script": { "params": { "queryVector": "${paperVector}" }, "lang": "painless", "source": "doc['@vector'].size() == 0 ? 1 : (cosineSimilarity(params.queryVector, '@vector') + 1.0)" } } } ], "max_boost": 3, "query": { "match_all": {} }, "score_mode": "multiply" } }, "query_weight": 1, "rescore_query_weight": 1, "score_mode": "multiply" }, "window_size": 200 } ], "size": 98 }