Elasticsearch losing data

(Keshav Agarwal) #1

I have a single node ES cluster with 26 GB RAM, 4 CPUs and 13 GB Heap Space. I am loading data into ES using threading in bulks of 5000 documents, the number of threads running can vary from 5 to 10. The problem that I am facing is that I am loading a true/false flag which is being lost by ES. Every data field exists in the final documents except that particular flag. Some documents contain the flag and some don't. The ratio of exist to not exist is about 2:7. The piece of code that introduces that flag is a fairly simple if-else block
if foo: doc['flag'] = True else: doc['flag'] = False
I have created a mapping for the index and specifically set the field type as boolean; I am unable to find a reason for the loss of the field. Any help would be greatly appreciated.

Note:- I am using ES 2.3.3

(David Pilato) #2

Do you mean that _source document does not contain the Boolean but other fields?

If so you have an issue in your injector.

(Keshav Agarwal) #3

Yes, that is the problem. I have checked the injector code and have checked for the existence of the flag in list of fields in every document, which returns that the flag exists; but when they are sent over to ES, the flag doesn't exist in _source.

(David Pilato) #4

This can not happen on elasticsearch side.
May be you are doing updates and not inserts using bulk but you don't control the response?

Anyhow you need to provide more details and may be your injector code.

(Keshav Agarwal) #5

I am overwriting the existing documents. I'm using python to inject data into ES, this is my injector code:

import csv
import json
from elasticsearch import Elasticsearch, helpers

client = Elasticsearch("http://localhost:9200")
entries = csv.DictReader(open("~/entries.csv"))
images_ids = json.load(open("~/image_ids.json"))
to_send = set()

for entry in entries:
    if entry['image_id'] in images_ids:
        entry['duplicate'] = True
        entry['duplicate'] = False
    entry = {
        "_index": "my_index",
        "_type": "entries",
        "_id": entry['id'],
        '_source': entry
    if len(to_send) >= 5000:
        helpers.bulk(client, to_send)
        to_send = set()

As you can see, it is a simple if else block that is working on the duplicate flag. Is this the correct way or should I use another method to index the documents?

(David Pilato) #6

Do you read the bulk response?

(Keshav Agarwal) #7

I tried printing it and it prints (5000, [])

I'm assuming that should be number of products indexed and errors.

(David Pilato) #8

Is it failing always for the same doc at every run?
Can you reproduce it?
Ideally with a REST script?

(Keshav Agarwal) #9

No, it fails randomly for any number of docs.
By reproducing do you mean that I should try indexing the documents using the bulk api in ES using REST instead of the method I currently use?

(David Pilato) #10

I was meaning something we can use to reproduce on our end.

If it fails randomly, can you print again the result of the bulk after it has failed?

(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.