Elasticsearch bulk index missing some records


(Dilip Kumar) #1

Hi,
i am using es 5.6 with python client and doing bulk index with some frequent log files.
what i need to care when going to bulk index and should not miss records. pls help


(Christian Dahlqvist) #2

Are you looking at the bulk response and checking that all documents were reported as successfully written?


(Dilip Kumar) #3

we are not looking for response, how can we handle it so successfully index all records.
any parameter or anything we should care, please suggest
our python client syntax : es = Elasticsearch([{'host': '192.168.1.xxx', 'port': 7205, 'timeout': 60},{'host': '192.168.1.yyy', 'port': 7205, 'timeout': 60}])


(Christian Dahlqvist) #4

You need to look at the bulk response and check that all records were successful. If some did fail, you need to handle those errors, e.g. by retrying them.


(Dilip Kumar) #5

What is possibility If we do not get error in response for record and that record not indexed in ES?


(Christian Dahlqvist) #6

I do not think that should happen. If you get an acknowledgement without error from Elasticsearch the document has been indexed, although it may not yet be searchable unless a refresh has run.


(Dilip Kumar) #7

yes you are right, but we are processing a file having approx 500 records all doc indexed but few records missed, without any error, so what we should take care


(Christian Dahlqvist) #8

Are these documents reported as successfully indexed in the bulk response? How are you checking which documents are missing? have you run a refresh or waited for one to occur before checking if they have been indexed? Are you allowing Elasticsearch to assign document IDs?


(Dilip Kumar) #9

we are having backup of processed files, in same file having 500 records few records are are not stored in es.


(Dilip Kumar) #10

we are giving our ID, oursample doc
{"index": {"_index": "events", "_type": "cm", "_id": "1132052186974525_65_1"}}
{"DT": "2018-07-04T10:27:15", "RE": "1", "PT": "2018-07-05T09:13:26", "ZI": "1132052186974525", "RT": "65"}
{"index": {"_index": "events", "_type": "cm", "_id": "1132052186974525_65_2"}}
{"DT": "2018-07-04T10:27:16", "RE": "2", "PT": "2018-07-05T09:13:26", "ZI": "1132052186974525", "RT": "65"}


(David Pilato) #11

have you run a refresh or waited for one to occur before checking if they have been indexed?

Could you answer that?

Could you share the full output of the Bulk Response that your python job is getting?
If too big for this forum, upload as a gist.github.com and share the link here.


(Christian Dahlqvist) #12

Are you sure your document IDs are unique?


(Dilip Kumar) #13

yes doc is unique, default setting for refresh is applied which is false


(Christian Dahlqvist) #14

Default refresh time is 1 second. Is that what you are using?


(Dilip Kumar) #15

yes default is here


(Christian Dahlqvist) #16

Do you have any non-default settings for Elasticsearch?


(Dilip Kumar) #17

no we have use default setting


(Dilip Kumar) #18

I have observed following issues in bulk indexing -

  • record that is included in bulk data post but not getting any response regarding that record
  • So my conclusion is that somewhere in ES the recording is missing

I am giving few logs of ES - DEBUG mode
_1. 2018-07-05 17:30:10,716 - root - DEBUG - File: /home/developer/p1_py/logs/eventlogs/event2.log.2_01 -- Response:{'errors': False, 'items': [ ..... {'index': {'_shards': {'total': 2, 'successful': 2, 'failed': 0}, 'created': True, '_index': 'campevents', '_version': 1, 'result': 'created', '_type': 'cmev', 'forced_refresh': True, 'status': 201, 'id': '3160379186122643_72_2'}}
_2018-07-05 17:30:01,674 - root - DEBUG - Event data {"index": {"_type": "cmev", "_index": "campevents", "id": "3160379186122643_72_2"}}:

_2. 2018-07-05 17:00:01,757 - root - DEBUG - Event data {"index": {"_index": "campevents", "_id": "1993761866142132_77_10", "type": "cmev"}}:

In above log we found that "_id": "1993761866142132_77_10" only appeared in Event data log but didn't appeared in Response:{'errors': False

Please review the above logs and give us your opinion
** we also change refresh interval 30 sec of that index and changed refresh : False, please suggest if missing anything so no chance to miss any data


(system) #19

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.