I'm trying to pull data from Elasticsearch using below two commands, both returned data with the same record count, however, sql_rdd returned all fields in elasticsearch, while es_rdd only returned timestamp, host and message fields as specified in the query filter. Query strings are the the same. Is the way I use SparkSql correct? How to make the filter work for SparkSql? Thanks a lot!
@zpp Use of anything other than the "query" element in the es.query configuration is not officially supported. In 5.0, the logic for handling the es.query option was unified across all integrations. In 5.2.0 there will be an officially supported avenue for specifying the desired source fields from the request.
Thanks a lot for the information, James.
I assume you're referring to ElasticSearch-Hadoop 5.2.0, which was just released on Jan 31. I did a quick testing on this version, without upgrading the elasticsearch cluster (not sure this is ok, but there are no errors). Using the same commands above, however, both returned all fields rather than the seletcted ones, even for newAPIHadoopRDD command, which was working with version 2.4.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.