I'm trying to get an ODBC connection to ES which pushes down as much as possible to ES. I'm trying to use the spark-sql thrift-server and the es-hadoop spark-sql connection but I'm having a hard time getting it to work.
How crazy is this idea?
Is it possible to get this to work?
Are there better ways to get an ODBC connection to ES?
There will be a lot of wrapping and multiple layers that each call to ES will have to go through.
ES doesn't offer any ODBC support per se so using something like Spark, would give you this. Does it work? In theory yes, in practice probably especially since thrift-server is involved.
Can you do an ODBC dirver for ES like MongoDB does.
It can me simple one: odbc driver take in query json request, and gives out json -> table
where fields of table are json fields with dots if nested
field1 | field2.nested1 | field2.nested2 | field3 etc
if field in document is empty it replaces with null in table
I am using this with Tableau, I had to "flat" my data (remove nested JSON), then with Spark HiveThriftServer2 I can connect from Tableau. It's not crazy and very easy to setup.
But it can be a bit slow, because Spark has to fetch all data from ES via HTTP before processing the SQL query. Be sure that Spark workers are located close to ES nodes, also check "CACHE TABLE" from SparkSQL.
Another way is to export your ES data to a big JSON file, and load this file with Spark. This is the best way to play with, no nested JSON issue, just pleasure
Don't forget to add the elastichsearch-hadoop JAR in the job class path, here how I am running the hive server:
/usr/local/spark/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 \
- elasticsearch-hadoop.jar is located in /opt/jars in my install.