I'm trying to get an ODBC connection to ES which pushes down as much as possible to ES. I'm trying to use the spark-sql thrift-server and the es-hadoop spark-sql connection but I'm having a hard time getting it to work.
How crazy is this idea?
Is it possible to get this to work?
Are there better ways to get an ODBC connection to ES?
There will be a lot of wrapping and multiple layers that each call to ES will have to go through.
ES doesn't offer any ODBC support per se so using something like Spark, would give you this. Does it work? In theory yes, in practice probably especially since thrift-server is involved.
I am using this with Tableau, I had to "flat" my data (remove nested JSON), then with Spark HiveThriftServer2 I can connect from Tableau. It's not crazy and very easy to setup.
But it can be a bit slow, because Spark has to fetch all data from ES via HTTP before processing the SQL query. Be sure that Spark workers are located close to ES nodes, also check "CACHE TABLE" from SparkSQL.
Another way is to export your ES data to a big JSON file, and load this file with Spark. This is the best way to play with, no nested JSON issue, just pleasure
Don't forget to add the elastichsearch-hadoop JAR in the job class path, here how I am running the hive server:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.