Very Large Terms Query


#1

I'm pretty new but it seems one of the most obvious applications of Elasticsearch is to filter a vast number of large documents with a terms filter containing the few words you're interested in. However, i have the inverse use case and I'm wondering if ES is as well suited to handle it. In my scenario, I have 100's of millions of small documents and I need to query to only return documents that contain at least one of 1000 words. For example, a typical indexed document might look like:
{
company_id: 435,
name: “Company X”,
…(other fields)…,
visibility_terms: “word_1, word_2, word_3, word_4”
}

and an example query may look like:
{
size:30,
from:0,
query: {
filtered:{
query:null,
filter:{
and:[{
term:{
company_id:435
},
terms:{
visibility_terms:[
..(up to 1000 unique words)..
]
}
}]
}
}
},
sort: [{
name:{
order:"asc"
}
}]
}

Something about having such a large 'or statement' feels wrong and destined to cause performance problems. Has anyone had experience issuing large terms queries like this that can shed some light on whether this is a bad idea or not?


(Jörg Prante) #2

1000 unique terms are not a problem, especially not in a filter.

There is a Lucene limit of 1024 terms in a query clause. But you could also submit a series of queries and join the result hits.

filtered:{query:null, filter:...} works? Interesting.


#3

Thanks for the reply! Yea I ran this query on a fully populated backup instance yesterday and it worked and was really performant. It's good to get some further verification that this query pattern is legit before I fully invest in this strategy. it's crazy that it can perform a 1000 line or statement so fast! ES is truly amazing :slight_smile:


(Jörg Prante) #4

You can achieve high performance with larger number of multi term boolean filters/queries when certain conditions can be met:

  • no scoring - skipping scoring and performing constant score query with multiple filter terms saves expensive term weighting computations.
  • no sorting - delivering results in the index order in they are found is faster than reordering documents
  • ORing terms is faster than ANDing all terms, since not all given terms must be visited before results can be delivered

#5

Thanks for the extra tips! Knowing this will definitely have an effect on the way I implement the feature. Unfortunately, some of our use cases require sorting. In that case I assume ORing vs ANDing won't help since all of the terms must be visited in order to sort them. However, when I ran the test query it included sorting and was still pretty fast so hopefully this won't be an issue.


(system) #6