HI Team,
I have been able to follow instructions to get my ORC data passed to Elasticsearch but having issues with my data source not having a timestamp and even with formatting the timestamp shows as numeric.
Below are my steps:
- Load pyspark and pass in the Elasticsearch Hadoop JAR:
Downloads/spark/bin/pyspark --jars ~/Downloads/elasticsearch-hadoop-6.3.0.jar
- Create a data frame from a local ORC file: df =
spark.read.format("orc").load("/Users/wtaylor/Downloads/TEST/*")
- Create a Temp Table so I can query my ORC and aggregate:
usage = df.registerTempTable("esexample")
- Cache results from temp from my SQL: aggUrldf = spark.sql(aggSql).cache()
Note in the SQL my date source field is in Epoch with MS but I change to timestamp:
timestamp(from_unixtime(start_time/1000)) as start_time
- I then pass to ES using following:
aggUrldf.write.format("org.elasticsearch.spark.sql").option("es.nodes.wan.only","true").option("es.nodes", esUrl).mode("Overwrite").option("es.net.http.auth.user",esUser).option("es.net.http.auth.pass",esPassword).save("indexname/doctype")
Verified my data is in ES. But format is numeric in Epoch. See example:
{
"_index": "indexname",
"_type": "test",
"_id": "qFdja2QBCIhbyqjdz7hd",
"_score": 1,
"_source": {
"id": "21590385",
"origination_airport": "KCLT",
"destination_airport": "KSEA",
"start_time": 1530444648000,
"client_ip": "10.34.11.162",
"url": "gateway.icloud.com",
"rx_total_bytes": 828,
"tx_total_bytes": 2578
}
I was unable to get a combination from Configuration | Elasticsearch for Apache Hadoop [8.11] | Elastic working.
Any ideas?
Thanks
Wayne