I have a list of must and must_not items that I currently have in a giant query but I want to know if this is the best way about the problem.
example of the query:
I have 470 must items and 485 must_not items that are a whitelist/blacklist type of rules for data. The analytic is built in spark and the data is housed in elastic search. The query I am passing to spark is a query with one of the must followed by all 485 must_not items.
{"query":{ "bool" : { "must" : {"match" : {"tag":"apple"}}, "must_not": [{ "match": { "city": "new york" }},{ "match": { "name": "pizza" }},........... ]}}}
As you can guess the query itself is rather large and takes around 2 seconds to return the results. I am submitting this type of query for each of the must items so therefore 470 queries passed. This application currently takes around 22 min to complete.
My question - Is this the best way to tackle this problem or is there a way to make it faster and is this even a good problem for elasticsearch at all given the gigantic complex query?
I have previously attempted to preform spark joins with the data after just passing a query with just the must_not data, which takes far longer than the 470 elastic search individual queries. I used a broadcast hash join because the must data is smaller that the resultant data frame.
Thank you for the help.