Elasticsearch-Hadoop formatting multi resouce writes issue

petroslamb · March 5, 2018, 12:08pm

I am interfacing Elasticsearch with Spark, using the Elasticsearch-Hadoop plugin and I am having difficulty writing a dataframe with a timestamp type column to Elasticsearch.

The problem is when I try to write using dynamic/multi resource formatting to create a daily index.

From the relevant documentation I get the impression that this is possible, however, the python example below fails to run unless I change my dataframe type to date.

import pyspark
conf = pyspark.SparkConf()
conf.set('spark.jars', 'elasticsearch-spark-20_2.11-6.1.2.jar')
conf.set('es.nodes', '127.0.0.1:9200')
conf.set('es.read.metadata', 'true')
conf.set('es.nodes.wan.only', 'true')
from datetime import datetime, timedelta
now = datetime.now()
before = now - timedelta(days=1)
after = now + timedelta(days=1)
cols = ['idz', 'name', 'time']
vals = [(0,'maria', before), (1, 'lolis', after)]  
time_df = spark.createDataFrame(vals, cols)

When I try to write, I use the following:

time_df.write.mode('append').format(
    'org.elasticsearch.spark.sql'
).options(
    **{'es.write.operation': 'index' }
).save('xxx-{time|yyyy.MM.dd}/1')

Unfortunatelly this renders an error:

.... Caused by: java.lang.IllegalArgumentException: Invalid format:
"2018-03-04 12:36:12.949897" is malformed at " 12:36:12.949897" at
org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945)

On the other hand this works perfectly fine if I use dates when I create my dataframe:

cols = ['idz', 'name', 'time']
vals = [(0,'maria', before.date()), (1, 'lolis', after.date())]  
time_df = spark.createDataFrame(vals, cols)

Is it possible to format a dataframe timestamp to be written to daily indexes with this method, without also keeping a date column around? How about monthly indexes?

Pyspark version:
spark version 2.2.1
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_151

Elasticsearch version
number "6.2.2" build_hash "10b1edd"
build_date "2018-02-16T19:01:30.685723Z" build_snapshot false
lucene_version "7.2.1" minimum_wire_compatibility_version "5.6.0"
minimum_index_compatibility_version "5.0.0"

Mirror question on SO:

james.baiera · March 16, 2018, 11:59pm

"2018-03-04 12:36:12.949897" is malformed at " 12:36:12.949897"

Instead of "2018-03-04 12:36:12.949897" try using the format of "2018-03-04T12:36:12". ES-Hadoop has to do quite of a bit of type related gymnastics because it supports many integrations that do not have schemas attached, especially when it comes to dates and timestamps. I suspect the timestamp format for python is just different enough that it trips it up.

system · April 13, 2018, 11:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Saving DF to Elasticsearch usig python Elasticsearch es-hadoop	2	5409	April 8, 2017
Connection Spark and ElasticSearch Elasticsearch es-hadoop	3	3277	August 27, 2017
Writing Dataframe to Elasticsearch using scala Elasticsearch es-hadoop	4	2347	February 4, 2019
Elasticsearch 2.0 and Spark - TimestampType conversion issue Elasticsearch es-hadoop	5	1779	July 6, 2017
Spark Structured streaming Elasticsearch	1	470	April 23, 2019

Elasticsearch-Hadoop formatting multi resouce writes issue

Related topics