Timestamp field being passed in epoch with Hadoop Library


(Wayne Taylor) #1

HI Team,

I have been able to follow instructions to get my ORC data passed to Elasticsearch but having issues with my data source not having a timestamp and even with formatting the timestamp shows as numeric.

Below are my steps:

  1. Load pyspark and pass in the Elasticsearch Hadoop JAR:
    Downloads/spark/bin/pyspark --jars ~/Downloads/elasticsearch-hadoop-6.3.0.jar
  2. Create a data frame from a local ORC file: df = spark.read.format("orc").load("/Users/wtaylor/Downloads/TEST/*")
  3. Create a Temp Table so I can query my ORC and aggregate:
    usage = df.registerTempTable("esexample")
  4. Cache results from temp from my SQL: aggUrldf = spark.sql(aggSql).cache()

Note in the SQL my date source field is in Epoch with MS but I change to timestamp:
timestamp(from_unixtime(start_time/1000)) as start_time

  1. I then pass to ES using following:
    aggUrldf.write.format("org.elasticsearch.spark.sql").option("es.nodes.wan.only","true").option("es.nodes", esUrl).mode("Overwrite").option("es.net.http.auth.user",esUser).option("es.net.http.auth.pass",esPassword).save("indexname/doctype")

Verified my data is in ES. But format is numeric in Epoch. See example:

{
"_index": "indexname",
"_type": "test",
"_id": "qFdja2QBCIhbyqjdz7hd",
"_score": 1,
"_source": {
"id": "21590385",
"origination_airport": "KCLT",
"destination_airport": "KSEA",
"start_time": 1530444648000,
"client_ip": "10.34.11.162",
"url": "gateway.icloud.com",
"rx_total_bytes": 828,
"tx_total_bytes": 2578
}

I was unable to get a combination from https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html#cfg-multi-writes-format working.

Any ideas?

Thanks
Wayne


(Wayne Taylor) #2

After working with ES team in git this is a bug. https://github.com/elastic/elasticsearch-hadoop/issues/1173


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.