Hive queries to read ES data taking too long | Need suggestions to improve


My ES version is 1.3.2
I setup an AWS EMR cluster (emr-4.2.0) with Hive 1.0.0 to provide an SQL-like interface to our data in ES. Since, our Analytics team prefers SQL like queries rather than ES.
I'm using the latest ES-Hadoop connector (elasticsearch-hadoop-2.2.0)

The setup was pretty easy and I'm able to connect to my ES (which is on a remote server) from Hive console.

However, queries that typically take few seconds in ES are taking 2 hours from Hive!
For e.g.
select count(*) from my_table where member_id = 1234;

I'm looking for suggestions to make this faster. Am I doing anything incorrectly?


To add some more information:
the time taken for the above Hive query (select count(*) from my_table where member_id = 1234;) across different shards was:

  1. Shard 0, which has 3.18 GB data, 8102573 docs took 59 mins.
  2. Shard 8, which has 0.02 GB data, 32432 docs took 16 secs.
    Some other shards have 7 GB of data, looks like it'll take 2-3 hrs on those shards.

There's not much that can be done here since Hive does not pushes down the query. That is the count in Hive means actually moving all the data from ES to Hive so it can count it.
ES-Hadoop has no visibility into the query being executed and thus cannot push it down (like in Spark SQL).

P.S. Upgrading your software stack should help - even if not performance wise, it will eliminate a good chunk of bugs.

Thanks for your reply Costin, I am now trying Spark SQL and the pushdown is working great!

I've created 2 more queries: