Partial update / Python

Hello !

My database contains 500 000 documents in the same index and I need to update some fields of every document every week. I use scrapy to get the data to update for each document.

Instead of updating each document one by one, to increase efficiency, I would like to build a request which updates the first 2000 documents, then the 2000 documents after...

I do the same to create the document for the first time, using helpers.bulk(es, self.actions), with self.actions containing different queries :

self.actions=[
    '_index': 'myindex',
    '_type': 'mytype',
    '_id': idItem,
    '_source': { 
       'title': 'title1',
       'views' : 100,
       'likes' : 200,
     },
    '_index': 'myindex',
    '_type': 'mytype',
    '_id': idItem2,
    '_source': { 
       'title': 'title2',
       'views' : 150,
       'likes' : 250,
     },
     ....
     ]

I've read a lot of topics and similar questions but I can't find the answer : if I use 'op_type' : 'update' it doesn't keep the fields that I don't want to update... Furthermore if I use a script using the update API I can't update 2000 documents at the same time...

Do you have a solution (in Python) ?

Thanks a lot !

The content to be merged as part of the update needs to be in an object called doc not _source. It errors (at least here in 6.0 alpha2) if you pass _source as part of an optype:_update. This works OK:

from elasticsearch import helpers
from elasticsearch import Elasticsearch

indexName = "test"
docType = "doc"
es = Elasticsearch()
es.indices.delete(index=indexName, ignore=[400, 404])
indexSettings = {
	"settings": {
		"number_of_shards": 1,
		"number_of_replicas": 0
	},
	"mappings": {
		docType: {
			"properties": {
				"name": {
					"type": "keyword"
				},
				"count": {
					"type": "integer"
				}
			}
		}
	}
}
es.indices.create(index=indexName, body=indexSettings)
actions = []

rowNum = 0
while rowNum <100:
		rowNum +=1
		action = {
			"_index": indexName,
			'_op_type': 'index',
			"_type": docType,
			"_id": rowNum,
			"_source": {
				"name":"mark",
				"count":1
			}
		}
		actions.append(action)
helpers.bulk(es, actions)

# Now do updates to 20 docs
rowNum = 0
actions=[]
while rowNum <20:
		rowNum +=1
		action = {
			"_index": indexName,
			'_op_type': 'update',
			"_type": docType,
			"_id": rowNum,
			"doc": {
				"count":2
			}
		}
		actions.append(action)
helpers.bulk(es, actions)
2 Likes

Thanks a lot, it works now!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.