My database contains 500 000 documents in the same index and I need to update some fields of every document every week. I use scrapy to get the data to update for each document.
Instead of updating each document one by one, to increase efficiency, I would like to build a request which updates the first 2000 documents, then the 2000 documents after...
I do the same to create the document for the first time, using helpers.bulk(es, self.actions), with self.actions containing different queries :
I've read a lot of topics and similar questions but I can't find the answer : if I use 'op_type' : 'update' it doesn't keep the fields that I don't want to update... Furthermore if I use a script using the update API I can't update 2000 documents at the same time...
The content to be merged as part of the update needs to be in an object called doc not _source. It errors (at least here in 6.0 alpha2) if you pass _source as part of an optype:_update. This works OK:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.