Perofrmance problem on es-hadoop + spark

(venkatesh sapram) #1

I am new to elastic search. I have 1 crore documents in elastic search. So for better performance i thought to use es-hadoop.
I ran one simple query which return 4000 documents directly on elastic search and as well as with elastic search hadoop-spark also.
out put :-
elastic search - 2.5 seconds
es-hadoop -spark - 13 seconds

Why is this much difference is there any way to tune.

(Costin Leau) #2

How are you doing the benchmark? ES performs the query and returns the first 10 docs. with Spark, ES-Hadoop does the same query but the results have to be parsed into objects - if you are using count in Spark (Rdd or dataFrame) that translates to all the data being pulled into Spark for it to be counted (that's how it works).

Also if you are thinking of using es-hadoop to get better performance than elasticsearch then I'm sorry to disappoint but that is never going to happen since es-hadoop is still elasticsearch behind the scenes. In fact, I recommend spending some quality time on the docs since the architecture and concepts are explained in details.

(venkatesh sapram) #3

Hi Costin,
Thanks for the clarification.
But for your question about bench marking answer is:
* Even in elastic search, i am using spring wrapper on top of elastic search. By using paginated query i am bringing all 4000 documents in one shot which is parsed in to java objects. that is happening in 2.5 seconds.

I want to understand when should i use the es-hadoop plugin.

(Costin Leau) #4

paginated query is by definition the opposite of bringing everything in one shot.
As for spark if you look into the background you'll notice it starts several other processes tasks as by definition it is distributed. This clearly will have an overhead as oppose to Spring/Spring Data.
Spark is used for data crunching as it is a computation framework.

As for when should be es-hadoop be used, this is again explained in the docs.

(venkatesh sapram) #5

Hi Costin,
I know that i am putting overload on spring. I am in r&d mode to choose which technology to usefor my analytics. That is the reason i am trying on both sides. One of my main goal for this task is performance.
Today we have all our data in mysql. We want to migrate to no sql which will give better queries and aggregations and ranking. That is the reason i am trying in both ends.
So my understanding was wrong on es-hadoop plugin. Thanks for the clarification.
I see the es-hadoop plugin in different way, but it was implemented for different reason.
Can you suggest some good tool for my requirements if you are interested.

Venkatesh Sapram.

(system) #6