Subsets

Hi everyone,

Thanks for reading.

Here's my problem. I have an array with numbers on each document as such:

Document 1: [1, 2, 3]
Document 2: [1, 4]
Document 3: [6,12,2]

I want to query my documents and find those documents that has subsets of my array: [1, 2, 3, 5, 10]. In this case only document 1 should be matched.

I've seen some ideas/hacks posted in old threads from 2013, but I hope Elastic Search has involved since then and I can solve this in a proper way. The arrays given can be 30-50, so calculating all permutations wont work.

Thanks again for reading - I hope you have some inputs.

This is a thing that has come up a few times over the years. If you still have the references to those old threads I think it is worth opening an enhancement request. Maybe we just need some docs or a blog post about the options. Or maybe we need a new query.

I'd go with a script to reject the documents that have an entry not in the set. I expect you'll get a performance boost doing something sneaky like

"script": {
  "script": "if (idSet == null) {idSet = ids.toSet()}; return doc.ids.every {idSet.contains(it)}",
  "params": {
    "ids": [1, 2, 3, 5, 10],
    "idSet": null
  }
}

Thank you, Nik90000

I'll see if I can dig up the threads I found on the subject and test out your script tonight!

Hi Nik,

I can't seem to make it work. Can you show it as an example? I can show you my document structure (sorry not sure how I can make it pretty):

{
  "xxx" : {
    "mappings" : {
      "node" : {
        "properties" : {
          "id" : {
            "type" : "double"
          },
          "searchTags" : {
            "type" : "nested",
            "properties" : {
              "id" : {
                "type" : "double"
              },
              "packages" : {
                "type" : "double"
              },
              "valueString" : {
                "type" : "string"
              }
            }
          }
        }
      }
    }
  }
}

The nested "SearchTags" are those who contain the array called "packages"

Hope you can help!

I honestly don't think I'm going to have time anytime in the near future to put together an example. But I can warn you that double might be difficult because of precision. Can you make it work on a subset of the data with strings and then see about other types?

Hi again nik!

It's fine, I got it to work.

For those who read this in the future:

Use nik's idea. Here it is in NEST:

 new FilteredQuery()
                                {
                                    Query = Query<ESDataPoint>.Filtered(ffff => ffff
                                        .Filter(d => d
                                            .Script( qwea => qwea
                                                .Script("return doc['searchTags.packages'].every { ids.contains(it) }")
                                                .Params(paramss => paramss
                                                    .Add("ids", new HashSet<double> { 1.0, 2.0, 3.0 })
                                                )
                                            )
                                        )
                                    )
                                }

Remember you are scripting in groovy - a horrible, horrible, HORRIBLE language. Elastic Search stores all (?) numbers as doubles, so if you make a script checking for integers in a double array, groovy doesn't throw an exception - it doesn't care and will return no found. Make sure you are searching for doubles in a double array.

Groovy is the best.

In Elasticsearch, the situation is not very transparent what is happening with numeric data in scripts. Internally, Groovy uses BigInteger or BigDecimal for very powerful (exact) numeric operations if dynamic typing for numbers is used (like in the script here). This does not match with Lucene data types (yet) since there is only double, float, and int. Elasticsearch has built-in auto-detection of field types. That is why you see double in Elasticsearch if you don't care for numeric types.

You can avoid the situation by always using strict typing for your ES fields by declaring them beforehand and, of course, use integer/double types in the Groovy script. Or use Lucene Expression language, which has no dynamic typing.

Also, look out for that contains doing O(n) operations. I think that json
comes through as a list by default and you'd have to use groovy to turn it
into a set. That is why I did that funky thing with the null parameter.

You likely could get better performance writing a lucene query in a plugin
but if this gets the job done I'd be happy with it.