Perofrmance problem on es-hive two table join

bluesea · April 11, 2016, 1:05pm

Hi all:
I use es-hive to execute two big table join, hive pull all data from es, it's very slow. I use jstack to print the process information:

I found it's very slow to pull data from es through http, how about transportclient? Is the transportclient faster than restclient when executing scroll query?
How can I read from searchresponse when I use transportClient to execute scroll? I mean that searchresponse.gethits return the native type(such as object,int etc), but it need writeable type in mapreduce framework?
can anyone help.
thanks!

costin · April 13, 2016, 12:48pm

The reference documentation explains the architecture and why REST is used instead of the transport client.
Hive doesn't provide push down so for even a simple count, it requires all the data - rest or transport client, this will heavily impact performance.
Double check your queries or use something like Spark SQL that does provide push down.

P.S. This is a common topic, one was just opened yesterday - "search" functionality is your friend.

bluesea · April 14, 2016, 3:44pm

Thanks.
this the point that rest or transport require all data, and this is the performance bottleneck. I have realized some simple predicate push down like termquery and rangequery, and this may filter part of the data, but we also read all data from ES sometimes.
Hive gets ES data through scan+scroll, scan stage is fast, what I most concern is that whether the transport client is faster than the rest client while scrolling. I'm planning to change the scroll stage using transport client instead of rest client. Do you hive any suggestion about it?

costin · April 21, 2016, 6:10am

If using transport client overall, would have been a better idea, we would have opted for that in ES-Hadoop. But that's not the case.