Matching by array elements


(Alexey Danilov) #1

I have to construct quite a non-trivial (as it seems to be now) query in Elasticsearch.
Suppose I have a couple of entities, each with an array element, consisting of strings:

1). ['A', 'B']
2). ['A', 'C']
3). ['A', 'E']
4). ['A']

Mappings for array element is as follows (using dynamic templates):

{
"my_array_of_strings": {
"path_match": "stringArray*",
"mapping": {
"type": "string",
"index": "not_analyzed"
}}}

Json representation of entity looks like this:

{
"stringArray":
["A", "B"]
}

Then I have user input:
['A', 'B', 'C'].

What I want to achieve is to find entities which contain only elements specified in input - expected results are:
['A', 'B'], ['A', 'C'], ['A'] but NOT ['A', 'E'] (because 'E' is not present in user input).

Can this scenario be implemented with Elasticsearch?


Querying documents from element in Array
Right tool for the job?
(Nik Everett) #2

There isn't an efficient way to do this but I can think of two OK ways to do it and I'm not sure which one is best:

  1. Index the documents with the string array just like that and search using a bool query (bool filter in 1.x) with each term in the OR clause, probably wrapped in a constant_score query. Then you use a script to recheck all the found documents for your restriction.

  2. Index the documents with an extra field that smooshes together the string array. So it'd be {"stringArray": ["A", "B"], "smooshedStringArray": "A_B"}. Then use an appropriate bool query on those. So if the user input is ["A', "B", "C"] then the query is:

"bool": {
  "should": [
    { "term": {"smooshedStringArray": "A"} },
    { "term": {"smooshedStringArray": "B"} },
    { "term": {"smooshedStringArray": "C"} },
    { "term": {"smooshedStringArray": "A_B"} },
    { "term": {"smooshedStringArray": "A_C"} },
    { "term": {"smooshedStringArray": "B_C"} },
    { "term": {"smooshedStringArray": "A_B_C"} }
  ]
}

The first way is going to store less data and be fast if your query is more selective without the script. The second way is going to be faster if some of the terms are really common but it won't work at all on large arrays.

You might be able to think of a way to combine them if you think harder than I have about the problem.


Greater than based on value from another field
(Alexey Danilov) #3

Thanks a lot for you suggestion!

In my case arrays could consist of up to 40 elements, so option 2). is not really viable. However, 1) - with scripting - might be fine.

However, apart from the solution with using the scripts - which should work
nicely, but will most likely slow down the query considerably in case
when there are many records that match - I have devised another one.
Below I will try to explain its main idea, without code implementation. Any comments are highly appreciated :slight_smile:

One considerable condition that I failed to mention (and which might
have given other users valuable hint) is that arrays consist of
enumerated elements, i.e. there are finite number of such elements in
array. This allows to flatten such array into separate field of an
entity.

Lets say there are 5 possible values: 'A', 'B', 'C', 'D', 'E'. Each
of these values is a boolean field - true if it is empty (i.e. array
version would contain this element ) and false otherwise.
Then each of the entities could be rewritten as follows:

1).
A = true
B = true
C = false
D = false
E = false

2).
A = true
B = false
C = true
D = false
E = false

3).
A = true
B = false
C = false
D = false
E = true

4).
A = true
B = false
C = false
D = false
E = false

With the user input of ['A', 'B', 'C'] all I would need to do is:
a) take all possible values (['A', 'B', 'C', 'D', 'E']) and subtract from them user input -> result will be ['D', 'E'];
b) find records where each of resulting elements is false, i.e. 'D = false AND E = false'.

This would give records 1, 2 and 4, as expected. I am still
experimenting with the code implementation of this approach, but so far
it looks quite promising. It has yet to be tested, but I think this
might perform faster, and be less resource demanding, than using scripts
in query.

To optimize this a little bit further, it might be possible not to
provide fields which will be 'false' at all, and modify the previous
query to 'D = not exists AND E = not exists' - result should be the
same.


(Nik Everett) #4

This works much better! I hadn't thought of this idea but I like it. So long as the total number of possible tokens stays pretty low it should work well. I don't think you need to flatten the document into true/false fields. The array should still work, you just need to use a bool query with must_not for the ones you don't want and must for the ones you do. You should also look at the terms query which finds documents with any of the terms you feed it. It has an optimization that works well for large numbers of terms and it just becomes a bool query with should clauses for small numbers of terms.


(Alexey Danilov) #5

Hmm, I am not sure I get the idea completely - could you please elaborate a little bit (or, ideally, provide a brief example)?

The thing is, I have no means of defining which terms in the input array are needed (must) and which are not (must_not). That is, all input terms are pretty equal, and cannot really be categorized into 'must' and 'must_not's. At least, I have not thought of a possible way to do this.

Btw, I have posted this question on stackoveflow as well and one of the answers I got was also suggesting using scripts. I am quite skeptical with the scripts though, and believe that as a general rule they should be used only as a last resort. Do you think that the solution I provided is better that using scripts - in terms of performance?

The script I was suggested to give a try is:

 {
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "terms": {
                "name": [
                  "A",
                  "B",
                  "C"
                ]
              }
            },
            {
              "script": {
                "script": " list_to_check = ['A', 'B', 'C']; do_not_return = true; for(element in doc['name'].values) { if(!list_to_check.contains(element)) {do_not_return = false;}};return do_not_return;"
              }
            }
          ]
        }
      }
    }
  }
}

(Nik Everett) #6

You approach will scale better. Scripts are useful but they have to visit all the documents and load data from doc values (think column store) and that takes time. Queries get to use the index.

I'm going to do this off the cuff without trying it so it might not work but you should get the idea. Say you have a document like: { "array": ["A", "B", "C"] } and the user sent ["A", "B", "C"]. You'd spit out a query like:

{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must": [ { "terms": { "array": [ "A", "B", "C" ] } } ],
          "must_not": [ { "terms": { "array": [ "D", "E", "F", "G", "H", "I", "J", "K", "L" ] } } ]
        }
      }
    }
  }
}

(Alexey Danilov) #7

Thanks, it is more clear now.

Am I right in understanding that "must_not" filter combined with array means that "none of the elements of the array should be present" in the document array, while "must" together with array translates to "any of the elements of the array should be present"? Somehow I could not find explanation of "must_not" filter in combination with arrays in documentation.

There is one other requirement which became clear recently - empty array in document should also be a match for any of the user input. Guess this means that "must" clause cannot be used anymore: "must": [ { "terms": { "array": [ "A", "B", "C" ] } } ] - but "must_not" alone should be quite sufficient, I guess. It will just have to operate on a larger document set.

Unless there is a way to match an empty array.

{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "should": [
            { "missing": {"field": "array"} },
            { "terms": { "array": [ "A", "B", "C" ] } }
           ],
          "must_not": [ { "terms": { "array": [ "D", "E", "F", "G", "H", "I", "J", "K", "L" ] } } ]
        }
      }
    }
  }
}

Search for something that must not contain xxx or xxx or xxx
(Alexey Danilov) #8

Now I have a somewhat similar case, but the solution above does not fit.

Say I have two entities with a string field:

  1. "very long tail"
  2. "not really long tail"

Then I have a search phrase "find me anything with a very long tail". What I want to achieve, is to find an entity which in its string field has only words specified in search query. So running this query has to return entity 1, but not 2.

I do not have any other clues besides search phrase - i.e. neither information how many words should match, nor which words should be a match (obviously). Words are not enumerated - they can be anything, so the trick I employed for the previous use case won't do here.

One solution that comes to mind is to use scripts for this: get contents of string field in each entity, and then manually match them to the search phrase. However, this solution will be too slow for the needs of my application.


(system) #9