FIlter by array? sort by number of items in array


(Andrey Kaprov) #1

HI all! There is one problem that I have met:
When I use filters when querying the elasticsearch I need to sort result documents by count of querried filters.
I.e. I have some documents with array field "category":

  1. [1,2,4]
  2. [1,2,5]
  3. [1]

When I need to filter by category 1 I expect to see documents order: 3, 1, 2
When I filter by categories 1 and 2 I expect: 1, 2, 3 (3 documtn is for category 1)


(Zachary Tong) #2

The filtering/matching part should be taken care of by regular filtering, but for the sorting behavior I think you may have to resort to a scripted sort: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html#_script_based_sorting

The behavior is relatively complicated: sort by number of matching categories, and break ties by sorting category ascending (presumably that's why the order is 3,1,2 on the first example). I don't think anything in ES can do that for you natively, but you should be able to work up the correct logic in a script and use that.


(Andrey Kaprov) #3

Thank you for reply! I think that script-based sorting would be a good choice.
I already though about the algorithm, it could be something like that:
ABS(doc.categories.length - request.categories.intercept(doc.categories).length).
But there is some trouble:

  1. Sort scripts will be executed for every filtered document in the search index (2 million and above)
  2. ES has an limitation of dynamic scripts compilations (15/min by default) I don't know how often those sorted data can be queried

(Zachary Tong) #4

The script will be invoked for every matching document (e.g. the docs that match the query and filters). So if all 2m docs match, then yes... the script will be invoked 2m times. Painless is pretty fast so I wouldn't be toooo concerned, but it is something to keep in mind. It's the price that has to be paid for extreme flexibility unfortunately.

This only applies to unique compilations. If you parameterize the script correctly so that all dynamically changing parts are provided in the params object, the script will only be compiled once. You can invoke it as many times as you like... it's just the compilation that is rate-limited.

There may be a way to approximate your algorithm if you index the size of the category array alongside the array itself. Then you could try to work out a sort value from the query score and the size. Not sure, haven't thought about it too deeply but thought I'd mention. I still think script is likely your best option :slight_smile:


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.