Match all terms in document array


#1

is there a way to match all terms in a document array? for eg. if my search array is ["1","2","3","4","5"]
and my documents have fields like

"arr":["1","3","5"]
"arr":["1","2","7","9"]
"arr":["1","8"]

Then only the first one should be a match because all the values in the document are present in the search array. I tried to use the script filter (to get the length of the array) and tried using the minimum_should_match parameter but I cant get it to work. How do I use a variable created by a script as a parameter for minimum_should_match?


(Igor Motov) #2

How many different terms do you expect to see in your index? In your example the list of terms is 1, 2, 3, 4, 5, 7, 8, 9. In your real application, does this list grow over time? Do you have control over the list? How big will it get? How many terms are you going to have in a typical query?Should a document with an empty array match all queries or at least one match has to occur?


#3

Thanks for the reply. My search array will contain anywhere between 10 and 100 values (typically 30 to 40). These values will change over time. It's like a list of things a user likes. It will be different for each user. A document with an empty array should match all queries but there probably won't be any documents with empty arrays so it's not that important.


(Igor Motov) #4

I don't think there is really a simple and efficient way to solve the problem the way you have stated it. The basic data structure behind the search engine is inverted index, it's basically a map from tokens to documents that contain these tokens. So, it's easy to find all documents that contain certain token and it's easy to find documents that don't contain certain token, but it's difficult to find all document that contain only a subset of tokens. And you cannot change minimum should match on per-record bases.

I see 4 main approaches that you can use to solve this problem, but they all have certain drawbacks.

  1. Script-based approach. Your search will search for documents that contain any of the terms in your query and then use script filter to through away all documents that contain any of the terms that are not in the list of desired term. You can also combine with query with a query for an empty arr field.

  2. Searching for a complimentary set. Searching for all documents that don't contain any tokens that are not in your list. Basically in your example you search would be -arr:6 -arr:10 assuming that you only have 10 different terms in your index.

  3. Combinatoric expansion of the query. Basically, you index arr as a single token of sorted values (the first record would be sorted as 1-3-5). And then you expand your query into all possible subsets: 1, 2, 3, 4, 5, 1-2, 1-3, 1-4, 1-5, 2-3, 2-4, 2-5, 3-4, 3-5, 1-2-3, 1-2-4, 1-2-5, 1-3-4, 1-3-5, 1-4-5, 1-2-3-4, 1-2-3-5, 1-2-4-5, 1-3-4-5, 2-3-4-5, 1-2-3-4-5. This is the basic idea, but there several optimization that can be applied to reduce the huge combinatoric expansion.

  4. Using percolator. Basically your arrays become search queries and you convert your search query into a document that "searches" the queries.

Can you give more information about your problem and what you are actually trying to achieve? There are might be other simpler approaches that might result in even better user experience.


(Nik Everett) #5

Could you write a query that searched a the terms, walked the ORed list and
loaded a count of the number of terms in the array (either separate field
or some trick with doc value cardinality) and checked that that number or
the filters came back true.

It wouldn't be efficient but probably would be better than scripts. Just
harder.


(system) #6