I want to query my documents and find those documents that has subsets of my array: [1, 2, 3, 5, 10]. In this case only document 1 should be matched.
I've seen some ideas/hacks posted in old threads from 2013, but I hope Elastic Search has involved since then and I can solve this in a proper way. The arrays given can be 30-50, so calculating all permutations wont work.
Thanks again for reading - I hope you have some inputs.
This is a thing that has come up a few times over the years. If you still have the references to those old threads I think it is worth opening an enhancement request. Maybe we just need some docs or a blog post about the options. Or maybe we need a new query.
I'd go with a script to reject the documents that have an entry not in the set. I expect you'll get a performance boost doing something sneaky like
I honestly don't think I'm going to have time anytime in the near future to put together an example. But I can warn you that double might be difficult because of precision. Can you make it work on a subset of the data with strings and then see about other types?
Remember you are scripting in groovy - a horrible, horrible, HORRIBLE language. Elastic Search stores all (?) numbers as doubles, so if you make a script checking for integers in a double array, groovy doesn't throw an exception - it doesn't care and will return no found. Make sure you are searching for doubles in a double array.
In Elasticsearch, the situation is not very transparent what is happening with numeric data in scripts. Internally, Groovy uses BigInteger or BigDecimal for very powerful (exact) numeric operations if dynamic typing for numbers is used (like in the script here). This does not match with Lucene data types (yet) since there is only double, float, and int. Elasticsearch has built-in auto-detection of field types. That is why you see double in Elasticsearch if you don't care for numeric types.
You can avoid the situation by always using strict typing for your ES fields by declaring them beforehand and, of course, use integer/double types in the Groovy script. Or use Lucene Expression language, which has no dynamic typing.
Also, look out for that contains doing O(n) operations. I think that json
comes through as a list by default and you'd have to use groovy to turn it
into a set. That is why I did that funky thing with the null parameter.
You likely could get better performance writing a lucene query in a plugin
but if this gets the job done I'd be happy with it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.