Hi,
I am trying to use spark ( pyspark) and my goal is to query from the huge Elasticsearch index a subset of data that matches a particular condition of column.
For example, if my columns are - name, age, timestamp
I want to get only those records where timestamp matches the current timestamp.
Could someone please assist on how to achieve this?
Hi @Khushboo_Kaul. The best way to get started is probably with spark-sql. One way is to create a temporary table from your index, and then run ordinary SQL queries on it:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.sql("CREATE TEMPORARY TABLE myTable USING org.elasticsearch.spark.sql OPTIONS (resource 'my_index')")
sqlContext.sql("select * from myTable").show()
Once you have that working you can add a where clause for the timestamp field like you would for any ordinary SQL query.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.