Determining the cause of data loss

Hi All,

i'm using jdbc plugin to transfer data from postgres to es. My sample data is 6028 records on postgres. I get different number of records transferred to es each time i try. Sometimes all 6028 records gets transffered to elasticsearch and sometimes fewer.
There is no error on elasticsearch or logstash logs.
My logstash config is as below:

input {
        jdbc {
                jdbc_connection_string => "jdbc:postgresql://192.168.0.6:5432/dbname"
                jdbc_user => "user"
                jdbc_password => "pass"
                jdbc_driver_library => "/home/postgresql-42.0.0.jar"
                jdbc_driver_class => "org.postgresql.Driver"
                statement => "SELECT * FROM booking_search"
        }
}

filter {
        json {
                source => "booking"
        }
        json {
                source => "join_field"
        }
}

output {
        stdout { codec => json_lines }
        elasticsearch {
                "hosts" => "localhost:9200"
                "index" => "bookings"
                "routing" => "%{booking_id}"
                "document_id" => "%{booking_id}"
        }
}

My booking_ids are unique, as i wrote above sometimes all data is transferred without a problem.

What can i do to find the cause of data loss?

That's kinda pointless with that many records.

Are you sure it is not Jdbc input plugin | Logstash Reference [8.11] | Elastic?

I looked up state part now, i am not using sql_last_value parameter. I also tried with clean_run => true just to be sure and still different number of records in each try.

How are you calculating the number of records in the table versus what was processed versus what is in Elasticsearch?

For logstash, i don't know how to check the processed records, for db and elasticsearch i'm running a count query.

curl -XGET localhost:9200/bookings/_count?pretty  -H 'Content-Type: application/json' -d ' 
{
"query" : {
"match_all" : { }
}
}'

{
  "count" : 5833,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  }
}

I added a file output:

 file {
                path => "/var/log/logstash/logstash-%{+YYYY-MM-dd}.log"
                codec => json_lines
        }

Then found the booking ids:

curl -XGET "localhost:9200/bookings/_search?pretty&from=0&size=10000" | grep '"_id"' | cut -d '"' -f 1,4 --output-delimiter=',' > booking-ids.txt

Then found the difference on postgres using these ids.

When i check with an id from txt file, i can find it in elasticsearch. And when i use one of the missing ids (postgres diff) i cannot find it in elasticsearch.

Returning back to file output, when i look for a missing id inside the logstash output file, i cannot find it there. So it seems that logstash didn't try to send it anyway. Any idea why that might happen?

If anyone is interested, i gave up on using logstash and went for bulk api instead. Now i can tranfer my data without any loss. There are other unsolved topics with the same problem. So this might really be a bug...

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.