I am facing this issue, the number of events out
by elasticsearch output plugin(using logstash monitoring api) is equal to records in csv this means all records in csv are sent to ES but unfortunately the records are always less in ES index when I send the _count request even after manually flushing using _flush api, what could be the issue here?
What do you have in logstash logs? Some of the documents could have been reject for some reason, like mapping errors.
Do you create a mapping for your index or are you using dynamic mapping?
If you are using dynamic mappings, elasticsearch will map a field according to the first value it receives for that field, if for example elasticsearch maps a field as a date field and later documents the same field has strings or numbers as the value, those later documents will be rejected.
Hi, I am using custom mapping for index. There is nothing there in logstash logs which may show that the records are rejected. One strange thing is this happens only when I index in upsert mode. I have also confirmed that the document_id is definitely distinct for each record in csv. Also, I want to know whether this out
field in monitoring api depicts the events successfully sent to ES or it just shows the number of events ready to be sent.
Did you enable DLQ that is where rejected documents are recorded.
I have enabled DLQ in Logstash. Still nothing there, dead_letter_queue.queue_size_in_bytes
this parameter always remains 1 in monitoring api. I have also setup dlq pipeline using dlq input plugin and file output plugin, this writes to a file which I am tailing but no output here as well.
Okay good
The interesting thing is this only happens when you use upsert.
It's easy to test
Do one run with upsert
Count your records _count and unique count on _id on the documents.
Then do without normal index operation a let elasticsearch create its own _id then do a _count and a unique account on your the field you're using for document ID from the upsert run.
My suspicion is those will differ by one and so you actually have a duplicate _id somewhere
OR
You're ingesting the header or something like that.
Off by one errors often are things like that.
I have tested this way as well. First I synced up records in insert mode, then run the query to get unique sno(field I am using as doc_id in upsert mode) using aggregation, the count is same as in csv. I have also double checked headers are not inserted (since I am using custom mapping this is not possible either way). Also, the records in csv and _count in upsert mode differs in millions not just by 1 and the number of records indexed before it gets stuck always differ in each run.
Apologies not sure where I got the difference by 1
And just to make sure when you use upsert mode records with the same id will overwrite each other.
I am not sure what the issue is if your would like to share your logstash conf and some records of your data perhaps we could take a look.
I finally found the bug. It is completely unrelated to logstash/es. When exporting csv from Postgres I am using limits
and offsets
to paginate, however I did not use orderby clause in query which resulted in data inconsistencies. I somehow missed this fact that limits&offset do not give result in same sequence every time unless order by
is used alongside.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.