I'm pretty new but it seems one of the most obvious applications of Elasticsearch is to filter a vast number of large documents with a terms filter containing the few words you're interested in. However, i have the inverse use case and I'm wondering if ES is as well suited to handle it. In my scenario, I have 100's of millions of small documents and I need to query to only return documents that contain at least one of 1000 words. For example, a typical indexed document might look like:
{
company_id: 435,
name: “Company X”,
…(other fields)…,
visibility_terms: “word_1, word_2, word_3, word_4”
}
and an example query may look like:
{
size:30,
from:0,
query: {
filtered:{
query:null,
filter:{
and:[{
term:{
company_id:435
},
terms:{
visibility_terms:[
..(up to 1000 unique words)..
]
}
}]
}
}
},
sort: [{
name:{
order:"asc"
}
}]
}
Something about having such a large 'or statement' feels wrong and destined to cause performance problems. Has anyone had experience issuing large terms queries like this that can shed some light on whether this is a bad idea or not?
Thanks for the reply! Yea I ran this query on a fully populated backup instance yesterday and it worked and was really performant. It's good to get some further verification that this query pattern is legit before I fully invest in this strategy. it's crazy that it can perform a 1000 line or statement so fast! ES is truly amazing
Thanks for the extra tips! Knowing this will definitely have an effect on the way I implement the feature. Unfortunately, some of our use cases require sorting. In that case I assume ORing vs ANDing won't help since all of the terms must be visited in order to sort them. However, when I ran the test query it included sorting and was still pretty fast so hopefully this won't be an issue.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.