I have the following "article" document:
{
"article_text" : { "type": "string" },
"found_at_urls": { "type": "string" },
}
I generate _id myself by hashing the article_text.
Article can appear at different urls and I would like to store all of the
urls it appeared at in the "found_at_urls" field (its usually more then 2-5
urls).
In SQL I would do INSERT... ON DUPLICATE KEY UPDATE...
How can approach this problem in ElasticSearch? Preferably in bulk mode and
support within PyES.
The default behavior in Elasticsearch is to re-insert a document if the id
is already found. This operation is not an update: the previous document
with that id will be deleted and the new one will be inserted
(atomically). See operation type:
I have the following "article" document:
{
"article_text" : { "type": "string" },
"found_at_urls": { "type": "string" },
}
I generate _id myself by hashing the article_text.
Article can appear at different urls and I would like to store all of the
urls it appeared at in the "found_at_urls" field (its usually more then 2-5
urls).
In SQL I would do INSERT... ON DUPLICATE KEY UPDATE...
How can approach this problem in Elasticsearch? Preferably in bulk mode
and support within PyES.
So if I have 10 millions of articles to index, and I can not calculate
their URLs in advance, I need to use update API (with upserts obviously)
and there is currently no way to do it in a bulk mode. Right?
Thanks,
Zaar
On Sunday, January 20, 2013 7:31:49 PM UTC+2, Ivan Brusic wrote:
The default behavior in Elasticsearch is to re-insert a document if the id
is already found. This operation is not an update: the previous document
with that id will be deleted and the new one will be inserted
(atomically). See operation type: Elasticsearch Platform — Find real-time answers at scale | Elastic
On Sun, Jan 20, 2013 at 6:43 AM, Zaar Hai <hai...@gmail.com <javascript:>>wrote:
Good day,
I have the following "article" document:
{
"article_text" : { "type": "string" },
"found_at_urls": { "type": "string" },
}
I generate _id myself by hashing the article_text.
Article can appear at different urls and I would like to store all of the
urls it appeared at in the "found_at_urls" field (its usually more then 2-5
urls).
In SQL I would do INSERT... ON DUPLICATE KEY UPDATE...
How can approach this problem in Elasticsearch? Preferably in bulk mode
and support within PyES.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.