Since Elasticsearch supports SQL (JDBC) interface in v6.5, could Spark (RDD or SQL) read data via JDBC? if it is feasible, how to deal with security on both sides (Elasticsearch, Spark job)? the env is currently configured to use user/password over SSL for both logstash and kibana.
the JDBC driver could be used by any JVM consumer including Spark RDD/SQL. Since the driver is for Elasticsearch, it takes care only of its security details, in particular how to securely communicate with Elasticsearch (more information at https://www.elastic.co/guide/en/elasticsearch/reference/master/sql-jdbc.html).
The Spark side would have to be taken into account inside the Spark job accordingly.
Do note that the JDBC driver doesn't have the parallelization of the Es-Spark/Hadoop integration - otoh it does provide a much richer and powerful SQL capabilities.
is there a way of retrieve ALL data from an index and bypassing the defaults ( page.size (default 1000)) ? I guess Spark will not take care of pagination in this case, and the total size is unknown to a job too.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.