Elasticsearch losing data

keshav · December 4, 2016, 3:18pm

I have a single node ES cluster with 26 GB RAM, 4 CPUs and 13 GB Heap Space. I am loading data into ES using threading in bulks of 5000 documents, the number of threads running can vary from 5 to 10. The problem that I am facing is that I am loading a true/false flag which is being lost by ES. Every data field exists in the final documents except that particular flag. Some documents contain the flag and some don't. The ratio of exist to not exist is about 2:7. The piece of code that introduces that flag is a fairly simple if-else block
if foo: doc['flag'] = True else: doc['flag'] = False
I have created a mapping for the index and specifically set the field type as boolean; I am unable to find a reason for the loss of the field. Any help would be greatly appreciated.

Note:- I am using ES 2.3.3

dadoonet · December 4, 2016, 4:44pm

Do you mean that _source document does not contain the Boolean but other fields?

If so you have an issue in your injector.

keshav · December 4, 2016, 5:07pm

Yes, that is the problem. I have checked the injector code and have checked for the existence of the flag in list of fields in every document, which returns that the flag exists; but when they are sent over to ES, the flag doesn't exist in _source.

dadoonet · December 4, 2016, 5:30pm

This can not happen on elasticsearch side.
May be you are doing updates and not inserts using bulk but you don't control the response?

Anyhow you need to provide more details and may be your injector code.

keshav · December 5, 2016, 5:29am

I am overwriting the existing documents. I'm using python to inject data into ES, this is my injector code:

import csv
import json
from elasticsearch import Elasticsearch, helpers


client = Elasticsearch("http://localhost:9200")
entries = csv.DictReader(open("~/entries.csv"))
images_ids = json.load(open("~/image_ids.json"))
to_send = set()

for entry in entries:
    if entry['image_id'] in images_ids:
        entry['duplicate'] = True
    else:
        entry['duplicate'] = False
    entry = {
        "_index": "my_index",
        "_type": "entries",
        "_id": entry['id'],
        '_source': entry
    }
    to_send.add(entry)
    if len(to_send) >= 5000:
        helpers.bulk(client, to_send)
        to_send = set()

As you can see, it is a simple if else block that is working on the duplicate flag. Is this the correct way or should I use another method to index the documents?

dadoonet · December 5, 2016, 5:47am

Do you read the bulk response?

keshav · December 5, 2016, 6:12am

I tried printing it and it prints (5000, [])

I'm assuming that should be number of products indexed and errors.

dadoonet · December 5, 2016, 6:41am

Is it failing always for the same doc at every run?
Can you reproduce it?
Ideally with a REST script?

keshav · December 5, 2016, 7:32am

No, it fails randomly for any number of docs.
By reproducing do you mean that I should try indexing the documents using the bulk api in ES using REST instead of the method I currently use?

dadoonet · December 5, 2016, 8:00am

I was meaning something we can use to reproduce on our end.

If it fails randomly, can you print again the result of the bulk after it has failed?

system · January 2, 2017, 8:00am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
My index loses data every 30 days Elasticsearch	7	714	November 6, 2019
Clearly I'm missing something Elasticsearch	3	407	July 6, 2017
Elasticsearch lost document updates Elasticsearch	7	2070	April 14, 2019
Data disappearing and new uuid Elasticsearch	15	1041	May 6, 2019
Performance issue in my elastic search cluster Elasticsearch	8	488	September 26, 2019

Elasticsearch losing data

Related topics