[elasticsearch-hadoop] How to specify es.mapping.id value from inside a map?


#1

Hi,

I am working using elasticsearch-hadoop dependency for uploading dataframes to AWS ES service.
I want to specify _id for every document that has to be uploaded. However the document id resides inside a map in my document to be uploaded.
Sample document that I upload to AWS elasticsearch -
{
"transactionIds": {
"purchaseId": "PUID:1234-5678-910"
},
"purchaseAttempts": [],
"timestamp": "2017-12-20T06:39:51.299Z[UTC]",
"year": 2017,
"month": 12,
"day": 20,
"hour": 6
}

I want purchaseId to be my document's _id.
How can I provide the entry of "purchaseId" present inside map of "transactionIds" to be the field for "es.mapping.id"?

Here is the ES configuration that I have tried but fails-
"es.mapping.id", "transactionIds.purchaseId"

I get runtime exception when I try to use the above config-

Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: [DataFrameFieldExtractor for field [[transactionIds, purchaseId]]] cannot extract value from entity [class java.lang.String] | instance [([[null,null,null,null,PUID:934-5439103-6862003,null],WrappedArray(),2017-12-20T06:59:03.900Z[UTC],2017,12,20,6],StructType(StructField(transactionIds,StructType(StructField(transactionId,StringType,true), StructField(marketplaceId,StringType,true), StructField(customerId,StringType,true), StructField(sessionId,StringType,true), StructField(purchaseId,StringType,true), StructField(orderIds,ArrayType(StringType,true),true)),true), StructField(purchaseAttempts,ArrayType(StructType(StructField(attemptId,StringType,true), StructField(attemptEvents,MapType(StringType,StructType(StructField(eventType,StringType,true), StructField(idType,StringType,true), StructField(id,StringType,true), StructField(startTime,LongType,true), StructField(endTime,LongType,true), StructField(attributes,MapType(StringType,StringType,true),true), StructField(paymentOption,StructType(StructField(paymentMethod,StringType,true), StructField(paymentAttrs,MapType(StringType,StringType,true),true)),true)),true),true)),true),true), StructField(timestamp,StringType,true), StructField(year,IntegerType,true), StructField(month,IntegerType,true), StructField(day,IntegerType,true), StructField(hour,IntegerType,true)))]
at org.elasticsearch.hadoop.serialization.bulk.AbstractBulkFactory$FieldWriter.write(AbstractBulkFactory.java:98)
at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.writeTemplate(TemplatedBulk.java:80)
at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.write(TemplatedBulk.java:56)
at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:159)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:97)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:97)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


#2

Specifying a non-map value like "timestamp" in the above document to be uploaded to ES works fine.
ie.
"es.mapping.id", "timestamp" => works fine
"es.mapping.id", "transactionIds.purchaseId" => throws exception


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.