Very Large Terms Query

I'm pretty new but it seems one of the most obvious applications of Elasticsearch is to filter a vast number of large documents with a terms filter containing the few words you're interested in. However, i have the inverse use case and I'm wondering if ES is as well suited to handle it. In my scenario, I have 100's of millions of small documents and I need to query to only return documents that contain at least one of 1000 words. For example, a typical indexed document might look like:
{
company_id: 435,
name: “Company X”,
…(other fields)…,
visibility_terms: “word_1, word_2, word_3, word_4”
}

and an example query may look like:
{
size:30,
from:0,
query: {
filtered:{
query:null,
filter:{
and:[{
term:{
company_id:435
},
terms:{
visibility_terms:[
..(up to 1000 unique words)..
]
}
}]
}
}
},
sort: [{
name:{
order:"asc"
}
}]
}

Something about having such a large 'or statement' feels wrong and destined to cause performance problems. Has anyone had experience issuing large terms queries like this that can shed some light on whether this is a bad idea or not?

1000 unique terms are not a problem, especially not in a filter.

There is a Lucene limit of 1024 terms in a query clause. But you could also submit a series of queries and join the result hits.

filtered:{query:null, filter:...} works? Interesting.

Thanks for the reply! Yea I ran this query on a fully populated backup instance yesterday and it worked and was really performant. It's good to get some further verification that this query pattern is legit before I fully invest in this strategy. it's crazy that it can perform a 1000 line or statement so fast! ES is truly amazing :slight_smile:

You can achieve high performance with larger number of multi term boolean filters/queries when certain conditions can be met:

  • no scoring - skipping scoring and performing constant score query with multiple filter terms saves expensive term weighting computations.
  • no sorting - delivering results in the index order in they are found is faster than reordering documents
  • ORing terms is faster than ANDing all terms, since not all given terms must be visited before results can be delivered

Thanks for the extra tips! Knowing this will definitely have an effect on the way I implement the feature. Unfortunately, some of our use cases require sorting. In that case I assume ORing vs ANDing won't help since all of the terms must be visited in order to sort them. However, when I ran the test query it included sorting and was still pretty fast so hopefully this won't be an issue.