Determining the cause of data loss

(oguz) #1

Hi All,

i'm using jdbc plugin to transfer data from postgres to es. My sample data is 6028 records on postgres. I get different number of records transferred to es each time i try. Sometimes all 6028 records gets transffered to elasticsearch and sometimes fewer.
There is no error on elasticsearch or logstash logs.
My logstash config is as below:

input {
        jdbc {
                jdbc_connection_string => "jdbc:postgresql://"
                jdbc_user => "user"
                jdbc_password => "pass"
                jdbc_driver_library => "/home/postgresql-42.0.0.jar"
                jdbc_driver_class => "org.postgresql.Driver"
                statement => "SELECT * FROM booking_search"

filter {
        json {
                source => "booking"
        json {
                source => "join_field"

output {
        stdout { codec => json_lines }
        elasticsearch {
                "hosts" => "localhost:9200"
                "index" => "bookings"
                "routing" => "%{booking_id}"
                "document_id" => "%{booking_id}"

My booking_ids are unique, as i wrote above sometimes all data is transferred without a problem.

What can i do to find the cause of data loss?

(Mark Walkom) #2

That's kinda pointless with that many records.

Are you sure it is not

(oguz) #3

I looked up state part now, i am not using sql_last_value parameter. I also tried with clean_run => true just to be sure and still different number of records in each try.

(Mark Walkom) #4

How are you calculating the number of records in the table versus what was processed versus what is in Elasticsearch?

(oguz) #5

For logstash, i don't know how to check the processed records, for db and elasticsearch i'm running a count query.

curl -XGET localhost:9200/bookings/_count?pretty  -H 'Content-Type: application/json' -d ' 
"query" : {
"match_all" : { }

  "count" : 5833,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0

(oguz) #6

I added a file output:

 file {
                path => "/var/log/logstash/logstash-%{+YYYY-MM-dd}.log"
                codec => json_lines

Then found the booking ids:

curl -XGET "localhost:9200/bookings/_search?pretty&from=0&size=10000" | grep '"_id"' | cut -d '"' -f 1,4 --output-delimiter=',' > booking-ids.txt

Then found the difference on postgres using these ids.

When i check with an id from txt file, i can find it in elasticsearch. And when i use one of the missing ids (postgres diff) i cannot find it in elasticsearch.

Returning back to file output, when i look for a missing id inside the logstash output file, i cannot find it there. So it seems that logstash didn't try to send it anyway. Any idea why that might happen?

(oguz) #7

If anyone is interested, i gave up on using logstash and went for bulk api instead. Now i can tranfer my data without any loss. There are other unsolved topics with the same problem. So this might really be a bug...

(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.