I want to scan documents that match for a list of queries. I want to use spark with elastic. I know that elasticsearch support multi query using endpoint _msearch, but I don't know to set up this type of query for spark intergration case (https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html).
The second question is that can you give some advices on chosing number of queries in a multi query (I only concern on the cases when my query type is search or scan).
What you need to know is when you use the Spark connector with Elasticsearch, you are actually moving the data from the elasticsearch cluster into Spark itself then you can use what ever domain specific transformation or computation on but you are out of the Elasticsearch scope.
Thus, you can't apply the multi search query on your data within Spark. It order to search easily on the data, you'll need to use the DataFrame structure which allows you SQL search queries.
As for your second question, I'm afraid that depends on your cluster configuration and data size considering that you are only using the multi search query on Elasticsearch itself.
You can benchmark your cluster performance using load testing tool like Tsung or Apache JMeter and then you can decided on the number of queries that can suite your use case (time-wise or size-wise, per example)
Personally, I use Tsung as it easy very light and easy to configure and use.
EDIT: I have written a blog post concerning basic load testing Elasticsearch with Tsung. You can find it here.
The connector doesn't support
msearch. Simply because it's quite complicated to map the results back the Hadoop/Spark consumer. And because doing so in parallel is fairly difficult.
What would be your use case? What are you trying to achieve ?
Thank you for all responses and sorry the late reply.
I agree that it's hard to map the results back the spark consumer. It is one of the reason that I quit trial this approach more.
I currently work on a project that I need to match a lot queries ( a contextual advertising system). Basically, we needs to match a lot of ads (which is described by a set of keywords) to content of each url that goes into our system. Firstly, I thought that we can use a search engine to fast matching keywords. But when the result of that is we have to join thousands of rdds. It became very slow in our practices. As I understand that we can apply multiple queries in a search (such as using
msearch) to use the system more effectively, so we hope that the time of execution can be better.
After several tries, we found that It might be impossible, so we quit to trial this approach. By matching directly content using appropriately data structure instead, we get more reasonable time than the idea of using search. However, elastic search can be still useful for small update.
I have the same problem. I have created 5,000 query to ES, it looks like ES cannot handle these 5,000 query in Spark.
However, I can do 100, it is quick.
So, what is your solution finally?
Hi Weizhi Li,
You can combine thefollowing approaches:
- Combine some queries at a time, like you said.
- If number of queries is large, just check by hand if a document is matched with any of them.
You are right. Now, I did the bucket query. Then, save the results to S3 sequence.
The speed is slow.