Giant Elasticsearch query


(Mary) #1

I have a list of must and must_not items that I currently have in a giant query but I want to know if this is the best way about the problem.

example of the query:

I have 470 must items and 485 must_not items that are a whitelist/blacklist type of rules for data. The analytic is built in spark and the data is housed in elastic search. The query I am passing to spark is a query with one of the must followed by all 485 must_not items.

{"query":{ "bool" : { "must" : {"match" : {"tag":"apple"}}, "must_not": [{ "match": { "city": "new york" }},{ "match": { "name": "pizza" }},........... ]}}}

As you can guess the query itself is rather large and takes around 2 seconds to return the results. I am submitting this type of query for each of the must items so therefore 470 queries passed. This application currently takes around 22 min to complete.

My question - Is this the best way to tackle this problem or is there a way to make it faster and is this even a good problem for elasticsearch at all given the gigantic complex query?

I have previously attempted to preform spark joins with the data after just passing a query with just the must_not data, which takes far longer than the 470 elastic search individual queries. I used a broadcast hash join because the must data is smaller that the resultant data frame.

Thank you for the help.


(Mary) #2

I decided to combine the it into one large query which drastically cut the run time down because the spark - ES overhead did't happen 470 times.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.