Very Large Terms Query

johari · January 22, 2016, 12:46am

I'm pretty new but it seems one of the most obvious applications of Elasticsearch is to filter a vast number of large documents with a terms filter containing the few words you're interested in. However, i have the inverse use case and I'm wondering if ES is as well suited to handle it. In my scenario, I have 100's of millions of small documents and I need to query to only return documents that contain at least one of 1000 words. For example, a typical indexed document might look like:
{
company_id: 435,
name: “Company X”,
…(other fields)…,
visibility_terms: “word_1, word_2, word_3, word_4”
}

and an example query may look like:
{
size:30,
from:0,
query: {
filtered:{
query:null,
filter:{
and:[{
term:{
company_id:435
},
terms:{
visibility_terms:[
..(up to 1000 unique words)..
]
}
}]
}
}
},
sort: [{
name:{
order:"asc"
}
}]
}

Something about having such a large 'or statement' feels wrong and destined to cause performance problems. Has anyone had experience issuing large terms queries like this that can shed some light on whether this is a bad idea or not?

jprante · January 22, 2016, 11:17am

1000 unique terms are not a problem, especially not in a filter.

There is a Lucene limit of 1024 terms in a query clause. But you could also submit a series of queries and join the result hits.

filtered:{query:null, filter:...} works? Interesting.

johari · January 22, 2016, 8:35pm

Thanks for the reply! Yea I ran this query on a fully populated backup instance yesterday and it worked and was really performant. It's good to get some further verification that this query pattern is legit before I fully invest in this strategy. it's crazy that it can perform a 1000 line or statement so fast! ES is truly amazing

jprante · January 22, 2016, 11:36pm

You can achieve high performance with larger number of multi term boolean filters/queries when certain conditions can be met:

no scoring - skipping scoring and performing constant score query with multiple filter terms saves expensive term weighting computations.
no sorting - delivering results in the index order in they are found is faster than reordering documents
ORing terms is faster than ANDing all terms, since not all given terms must be visited before results can be delivered

johari · January 25, 2016, 9:03pm

Thanks for the extra tips! Knowing this will definitely have an effect on the way I implement the feature. Unfortunately, some of our use cases require sorting. In that case I assume ORing vs ANDing won't help since all of the terms must be visited in order to sort them. However, when I ran the test query it included sorting and was still pretty fast so hopefully this won't be an issue.

Topic		Replies	Views
Large terms query slow Elasticsearch	2	1663	January 13, 2017
Terms queries with lots of terms Elasticsearch	2	1010	July 5, 2017
Speed of query with many filters Elasticsearch	6	371	July 6, 2017
Filter with millions of record Elasticsearch	44	5825	August 2, 2018
Comparing Large Text Documents -- Queries with Large Text Fields Elasticsearch	2	923	July 6, 2017

Very Large Terms Query

Related topics