What is way to approach "INSERT ON DUPLICATE KEY UPDATE" in ES

Good day,

I have the following "article" document:
{
"article_text" : { "type": "string" },
"found_at_urls": { "type": "string" },
}

I generate _id myself by hashing the article_text.

Article can appear at different urls and I would like to store all of the
urls it appeared at in the "found_at_urls" field (its usually more then 2-5
urls).

In SQL I would do INSERT... ON DUPLICATE KEY UPDATE...

How can approach this problem in ElasticSearch? Preferably in bulk mode and
support within PyES.

Thank you,
Zaar

--

The default behavior in ElasticSearch is to re-insert a document if the id
is already found. This operation is not an update: the previous document
with that id will be deleted and the new one will be inserted
(atomically). See operation type:
http://www.elasticsearch.org/guide/reference/api/index_.html

Sounds like you want the update API:
http://www.elasticsearch.org/guide/reference/api/update.html However, the
bulk API does not allow updates
https://github.com/elasticsearch/elasticsearch/issues/1985 Since an update
is still an atomic delete-create, there is no benefit than re-adding a
document. Hopefully your workflow allows you access to create an entire
document.

Cheers,

Ivan

On Sun, Jan 20, 2013 at 6:43 AM, Zaar Hai haizaar@gmail.com wrote:

Good day,

I have the following "article" document:
{
"article_text" : { "type": "string" },
"found_at_urls": { "type": "string" },
}

I generate _id myself by hashing the article_text.

Article can appear at different urls and I would like to store all of the
urls it appeared at in the "found_at_urls" field (its usually more then 2-5
urls).

In SQL I would do INSERT... ON DUPLICATE KEY UPDATE...

How can approach this problem in ElasticSearch? Preferably in bulk mode
and support within PyES.

Thank you,
Zaar

--

--

So if I have 10 millions of articles to index, and I can not calculate
their URLs in advance, I need to use update API (with upserts obviously)
and there is currently no way to do it in a bulk mode. Right?

Thanks,
Zaar

On Sunday, January 20, 2013 7:31:49 PM UTC+2, Ivan Brusic wrote:

The default behavior in ElasticSearch is to re-insert a document if the id
is already found. This operation is not an update: the previous document
with that id will be deleted and the new one will be inserted
(atomically). See operation type:
http://www.elasticsearch.org/guide/reference/api/index_.html

Sounds like you want the update API:
http://www.elasticsearch.org/guide/reference/api/update.html However, the
bulk API does not allow updates
https://github.com/elasticsearch/elasticsearch/issues/1985 Since an
update is still an atomic delete-create, there is no benefit than re-adding
a document. Hopefully your workflow allows you access to create an entire
document.

Cheers,

Ivan

On Sun, Jan 20, 2013 at 6:43 AM, Zaar Hai <hai...@gmail.com <javascript:>>wrote:

Good day,

I have the following "article" document:
{
"article_text" : { "type": "string" },
"found_at_urls": { "type": "string" },
}

I generate _id myself by hashing the article_text.

Article can appear at different urls and I would like to store all of the
urls it appeared at in the "found_at_urls" field (its usually more then 2-5
urls).

In SQL I would do INSERT... ON DUPLICATE KEY UPDATE...

How can approach this problem in ElasticSearch? Preferably in bulk mode
and support within PyES.

Thank you,
Zaar

--

--