I have a single node ES cluster with 26 GB RAM, 4 CPUs and 13 GB Heap Space. I am loading data into ES using threading in bulks of 5000 documents, the number of threads running can vary from 5 to 10. The problem that I am facing is that I am loading a true/false flag which is being lost by ES. Every data field exists in the final documents except that particular flag. Some documents contain the flag and some don't. The ratio of exist to not exist is about 2:7. The piece of code that introduces that flag is a fairly simple if-else block if foo: doc['flag'] = True else: doc['flag'] = False
I have created a mapping for the index and specifically set the field type as boolean; I am unable to find a reason for the loss of the field. Any help would be greatly appreciated.
Yes, that is the problem. I have checked the injector code and have checked for the existence of the flag in list of fields in every document, which returns that the flag exists; but when they are sent over to ES, the flag doesn't exist in _source.
I am overwriting the existing documents. I'm using python to inject data into ES, this is my injector code:
import csv
import json
from elasticsearch import Elasticsearch, helpers
client = Elasticsearch("http://localhost:9200")
entries = csv.DictReader(open("~/entries.csv"))
images_ids = json.load(open("~/image_ids.json"))
to_send = set()
for entry in entries:
if entry['image_id'] in images_ids:
entry['duplicate'] = True
else:
entry['duplicate'] = False
entry = {
"_index": "my_index",
"_type": "entries",
"_id": entry['id'],
'_source': entry
}
to_send.add(entry)
if len(to_send) >= 5000:
helpers.bulk(client, to_send)
to_send = set()
As you can see, it is a simple if else block that is working on the duplicate flag. Is this the correct way or should I use another method to index the documents?
No, it fails randomly for any number of docs.
By reproducing do you mean that I should try indexing the documents using the bulk api in ES using REST instead of the method I currently use?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.