How to upsert an initial value into elasticsearch using spark?

Terran_Yiu · September 17, 2015, 1:02am

With HTTP POST, the following script can insert a new field createtime or update lastupdatetime:

curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"doc": {
    "lastupdatetime": "2015-09-16T18:00:00"
}
"upsert" : {
    "createtime": "2015-09-16T18:00:00"
    "lastupdatetime": "2015-09-16T18:00:00",
}
}'

But in spark script, after setting "es.write.operation": "upsert", i don't know how to insert createtime at all. There is only es.update.script.* in the official document... So, can anyone give me an example?

eliasah · September 19, 2015, 10:11am

I'm not very sure about that. But what I'll try to do is defining as es.mapping.id with the key of the document you want to upsert in the SparkConf.

val conf = new SparkConf()
[...]
conf.set("es.write.operation", "upsert")
// you can set the the document field/property name containing the document id.
// I believe that you are able to know that you should change <id> 
// with the desired field name
conf.set("es.mapping.id",<id>)

I haven't tried this but I think that it should work!

Let us know if it works for you!

Terran_Yiu · September 20, 2015, 6:33am

Thank you for your help.

In my case, i want to save the information of Android devices from log into one elasticsearch type, and set it's first appearance time as createtime.

If the device appear again, only update the lastupdatetime, but leave the createtime as it was. So the document id is android ID, if the id exists, update lastupdatetime, else insert createtime and lastupdatetime. So the setting here is(in python):

    conf = {
        "es.resource.write": "stats-device/activation",
        "es.nodes": "NODE1:9200",
        "es.write.operation": "upsert",
        "es.mapping.id": "id"
        # ???
    }
 
    rdd.saveAsNewAPIHadoopFile(
        path='-',
        outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
        keyClass="org.apache.hadoop.io.NullWritable",
        valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
        conf=conf
    )

I don't know how to insert a new field if the id not exist...

eliasah · September 20, 2015, 8:02am

The id doesn't exist where? In Elasticsearch or it the data you are reading?

Terran_Yiu · September 20, 2015, 8:44am

Well, id is just in the input data.

eliasah · September 20, 2015, 9:02am

Then theoretically speaking it should only be available in your data when you perform an upsert

Terran_Yiu · September 20, 2015, 11:15am

So you means i can't upsert any new field not in my data when i use spark? OK, i got it.... Thank you.

eliasah · September 20, 2015, 1:38pm

I didn't say that. Let's agree first of the definition of an upsert action which is the following :

If the document does not already exist, the contents of the upsert element will be inserted as a new document. If the document does exist, then the script will be executed instead.

Which in your case will be :

update on id since you defined es.mapping.id -> id
insert a document with the _id equals id

Terran_Yiu · September 21, 2015, 3:11am

Sorry, i couldn't login this site last day.

You are right. The requirement of my case is just:

Update lastupdatetime if the id already exist in elasticsearch;
Insert lastupdatetime and createtime=lastupdatetime when id not exist in elasticsearch;

The source doc is just like this:

{
    'id': 'xxxxx',
    'lastupdatetime': '2015-09-20'
}

The problem is that a new field createtime will be added to the source doc if id not exist in es yet. I don't know how to solve this problem in spark.

Here is a solution which is not perfect:

add createtime to all source doc (rdd.map(lambda d: d['createtime']=d['lastupdatetime'])
save to es with create and ignore 409
remove createtime field
save to es again with update

After these steps, i get what i want. But if there is a better solution?

eliasah · September 21, 2015, 3:10pm

What is the structure of your final rdd before writing it to es?

Terran_Yiu · September 22, 2015, 1:51am

Just like my last post, i write the doc to es twice now.
At first time, using create, rdd structure is

{
    'id': 'xxxxx',
    'lastupdatetime': '2015-09-20',
    'createtime':'2015-09-20',
}

At second time, using update, rdd structure is

{
    'id': 'xxxxx',
    'lastupdatetime': '2015-09-20'
}

so, if the id already exist, create will be fail, only lastupdatetime will be updated. However, i believe these 2 operations can be combined into 1.

Terran_Yiu · September 24, 2015, 10:43am

At last, i found elasticsearch-hadoop don't support create if not exist, else do nothing.

Terran_Yiu · September 24, 2015, 3:28pm

from this post, i decide to build the elasticsearch-hadoop jar myself.

costin · September 26, 2015, 2:35pm

Sorry for the late reply.
It seems the update support is not complete. I've seen you already raised an issue (great!) here so Iet's use that to track progress.

Enhancing the doc only to do another update is far from the proper solution. And doing bulk requests which create exceptions which later on are ignored is even worse.
This should be properly fixed in one call not 3 plus exceptions.

Topic		Replies	Views
Upsert ELasticSearch documents with Spark Elasticsearch es-hadoop	1	1658	December 13, 2017
How to do upsert in ElasticSearch 5.3.2 using Spark Structured Streaming 2.3.0? Elasticsearch es-hadoop	5	1729	May 30, 2018
Is there a way to "update" ES records using Spark? Elasticsearch	4	1175	July 6, 2017
Spark RDD.saveToES Elasticsearch es-hadoop	4	5435	July 6, 2017
How to update about JavaEsSpark.saveToEs Elasticsearch es-hadoop	1	1487	December 1, 2017

How to upsert an initial value into elasticsearch using spark?

Related topics