Perofrmance problem on es-hadoop + spark

ven2286 · January 20, 2016, 3:45pm

I am new to elastic search. I have 1 crore documents in elastic search. So for better performance i thought to use es-hadoop.
I ran one simple query which return 4000 documents directly on elastic search and as well as with elastic search hadoop-spark also.
out put :-
elastic search - 2.5 seconds
es-hadoop -spark - 13 seconds

Why is this much difference is there any way to tune.

costin · January 20, 2016, 4:47pm

How are you doing the benchmark? ES performs the query and returns the first 10 docs. with Spark, ES-Hadoop does the same query but the results have to be parsed into objects - if you are using count in Spark (Rdd or dataFrame) that translates to all the data being pulled into Spark for it to be counted (that's how it works).

Also if you are thinking of using es-hadoop to get better performance than elasticsearch then I'm sorry to disappoint but that is never going to happen since es-hadoop is still elasticsearch behind the scenes. In fact, I recommend spending some quality time on the docs since the architecture and concepts are explained in details.

ven2286 · January 20, 2016, 5:13pm

Hi Costin,
Thanks for the clarification.
But for your question about bench marking answer is:
* Even in elastic search, i am using spring wrapper on top of elastic search. By using paginated query i am bringing all 4000 documents in one shot which is parsed in to java objects. that is happening in 2.5 seconds.

I want to understand when should i use the es-hadoop plugin.

costin · January 20, 2016, 5:54pm

paginated query is by definition the opposite of bringing everything in one shot.
As for spark if you look into the background you'll notice it starts several other processes tasks as by definition it is distributed. This clearly will have an overhead as oppose to Spring/Spring Data.
Spark is used for data crunching as it is a computation framework.

As for when should be es-hadoop be used, this is again explained in the docs.

ven2286 · January 21, 2016, 6:14am

Hi Costin,
I know that i am putting overload on spring. I am in r&d mode to choose which technology to usefor my analytics. That is the reason i am trying on both sides. One of my main goal for this task is performance.
Today we have all our data in mysql. We want to migrate to no sql which will give better queries and aggregations and ranking. That is the reason i am trying in both ends.
So my understanding was wrong on es-hadoop plugin. Thanks for the clarification.
I see the es-hadoop plugin in different way, but it was implemented for different reason.
Can you suggest some good tool for my requirements if you are interested.

Thanks&Regards,
Venkatesh Sapram.

Topic		Replies	Views
Slow Performance of Elastic Search with Spark Elasticsearch es-hadoop	4	1564	July 29, 2021
Tunning ElasticSearch with Spark Elasticsearch	1	384	July 5, 2017
Performance Challenge Elasticsearch es-hadoop	6	1082	April 28, 2017
ESHadoop - Hadoop vs Spark Elasticsearch es-hadoop	3	1230	July 6, 2017
Elasticsearch + Spark read performance issues Elasticsearch es-hadoop	3	2280	May 24, 2016

Perofrmance problem on es-hadoop + spark

Related topics