Logstash duplicating records while reading from elasticsearch and writing to bigquery

ThiagoSantos · September 7, 2020, 5:08am

Hi, everyone.

Everytime logstash runs, it writes all records from my elasticsearch index to my bigquery database.
The problem is that the records get duplicated at every run.
My goal is to get only those records that were updated or inserted at the index to get updated/inserted at the bigquery database.

Is that possible?

This is my config file:

    input {
            elasticsearch {
                    hosts => ["https://myclusteraddress:9200/"]
                    index => "myindex*"
                    user => "myusename"
                    password => "mypassword"
                    docinfo => true
            }
    }

    filter {

        mutate {

            join => { "originResponseFiles" => "," }

            rename => ["[account][type]", "[account][accountType]" ]
            rename => ["[account][agency]", "[account][accountAgency]" ]
            rename => ["[account][number]", "[account][accountNumber]" ]

            remove_field => ["@timestamp"]
            remove_field => ["cardBrand"]
            remove_field => ["@version"]
        }
        ruby {
            code => "

                event.get('account').each {|k, v|
                    event.set(k, v)
                }
                event.remove('account')
            "
        }
    }

    output {
            #stdout {
            #       codec => rubydebug
            #}
            google_bigquery {
                    project_id => "data-prod-248920"
                    dataset => "sandbox"
                    table_prefix => "retorno_bloqueio_domicilio"
                    batch_size => 1000
                    id => "ES_to_BQ"
                    table_separator => ""
                    csv_schema => "liquidId:STRING,statusDescription:STRING,paymentDate:STRING,rejectionDate:STRING,accountAgency:STRING,accountNumber:STRING,accountType:STRING,accountAccount:STRING,amount:FLOAT,documentNumber:STRING,id:STRING,updatedAt:STRING,originRequestFile:STRING,merchantName:STRING,errors:STRING,expectedDate:STRING,originResponseFiles:STRING,bankName:STRING,status:STRING,type:STRING"
                    json_key_file => "/somepath/somekey.key"
                    error_directory => "logs"
                    date_pattern => ""
                    flush_interval_secs => 30
       }
    }

Thanks!

warkolm · September 7, 2020, 6:07am

That's the way that the Elasticsearch input works. You will need to figure out a way to create a unique ID for each event, and then overwrite them in BQ.

ThiagoSantos · September 7, 2020, 12:01pm

I do have an unique id (the "id" column). I even have a column with the date of the last update.

I can work some logic on that.

Do you know how can I make the big query output plugin to just update the document based on the document id?

Thanks.

warkolm · September 7, 2020, 8:53pm

I don't sorry.

system · October 5, 2020, 8:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash - ES to BigQuery - doesn't load all rows into BigQuery Logstash	2	936	July 26, 2019
Logstash write data to the elasticsearch how to remove duplication Logstash	4	669	July 6, 2017
Avoiding duplicate records Logstash	4	2496	November 2, 2017
Updating data which are newly added Logstash	3	2122	April 25, 2017
Duplicate entries into Elastic Search Logstash	8	2165	June 5, 2019

Logstash duplicating records while reading from elasticsearch and writing to bigquery

Related topics