Logstash adding duplicate rows for every run

Hello World,
We implemented a solution to push Microsoft SQL Data in form of simple rows from SQL Server to Logstash which will index the data to Elasticsearch. The data (SQL rows and columns) are getting successfully by using a simple jdbc plugin to logstash and logstash indexing the data to elasticsearch. But, everytime we try to pump the data (same data) what its doing is adding duplicate records with same data but "with different _id". SO everytime we pump the data from our client to logstash its keep on duplicating the records and its becoming messier. So we tried to get the @uuid setup (https://www.elastic.co/guide/en/logstash/current/plugins-filters-uuid.html) in place to fix this duplicating issue but that didnt solved the issue. Its keep on duplicating the data with new "_id" & new "@uiid" values.
We tried many combinations of @uuid setups, non of them worked.
One of the blogger said if we can get a column inSQL with name "_id" then logstahs wont create this unique identifier with random values (had to try this)
Any inputs is appreciated here..

Thanks
Vikram Y

The elasticsearch output's document_id option can be used to set the id of the document in ES. If you pick a document id that can remain constant each time you fetch the data from the database then Logstash has a chance to update the existing documents instead of creating new ones. If the source database table has a primary key then that's what you'd use. Otherwise you'll have to find some other set of columns that won't change as the rows of the source table are updated.

Thanks Magnusbaeck for the reply here.
However, I was under the impression that setting up the UUID means setting the document_id and that didn't worked for me.
Is there a way of setting that specific "document_id" value within the forwarder configuration?

Please do let me know.

Thanks
Vikram Y

Is there a way of setting that specific "document_id" value within the forwarder configuration?

I'm not sure exactly what you mean. The document id is set in the elasticsearch output. That string could be based on one or more other fields, whose values can be set earlier in the pipeline. Perhaps you can use an example to explain.

Sorry Magnusbaeck if my statement was not clear.
We are not giving the output straight to Elasticsearch here. We are using redis/logstash (server level) to stack, queue & index the data before it reaches the elasticsearch.
Please see the syntax below:
input {
jdbc {
jdbc_driver_library => "C:\Program Files\sqljdbc_6.0\enu\sqljdbc42.jar"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_connection_string => "jdbc:sqlserver://sqlserver/instance:1433;"
jdbc_user => “username"
jdbc_password => “password"
statement_filepath => “storedproc file path"

}
}

filter {
mutate {
add_field => {"index_prefix" => “indexprefixname"}
}
}

output {
redis { host => “redisurl" data_type => "list" key => "logstash" }
stdout { codec => "rubydebug" }
}

At this point can I add a "filter" saying that use a unique value from the database output as a primary key for document_id so that the data wont get duplicated?
Thanks

Thanks
Vikram Y

Unless you're deleting the primary key fields you can just reference them when you set the document_id option later on, but if you prefer you can set a temporary field in the file above and reference that field instead.

File with jdbc input plugin:

filter {
  mutate {
    add_field => {
      "[@metadata][document_id]" => "%{primary_key_1}%{primary_key_2}"
    }
  }
}

File with elasticsearch output:

output {
  elasticsearch {
    ...
    document_id => "%{[@metadata][document_id]}"
  }
}

Again, if you don't delete the primary key fields the configuration above is equivalent to just this:

output {
  elasticsearch {
    ...
    document_id => "%{primary_key_1}%{primary_key_2}"
  }
}

Sure.. We are not deleting the primary key fields here..
In this case, the output is bunch of SQL rows and columns. Lets say I got a column name "startinterval" and I want to use the value from this particular column as primary key for the document_id here.. How can I call the column value of "startinterval" from each row and assign that as "primary_key" here..

I tired to give the column name (startinterval) as document_id and thats only taking it as static value and the output is only one row with the "_id" as that static value.. I figured thats not the right way to call the data here..

output {
elasticsearch {
...
document_id => "%{startinterval}"
}
}

Thanks
Vikram Y

Actually magnusbaeck, the adding the metadata worked for me..
Let me run few more tests and willl get back to u..
Thanks a lot for the help here...
Appreciate it..

Thanks
Vikram Y

Yup.. it worked..
Thanks again for ur replies sir..

Thanks
Vikram Y

Hi Magnusbaeck, I want to revisit this old topic again.. As I showed before the output we are trying to pass it to redis and redis gets the queue to logstash and logstash will index the data and will put it in elasticsearch indexes. In this scenario I tried to add document_id field with a primary key (unique value from database). When I did that it created blank indexes (no data at all). So it seems redis/logstash setups didnt liked me passing on the document_id value manually.. it seems it will decide what value should be. So I figured that I need to setup this document_id not within injectors rather in redis/logstash configuration..
I need your inputs on this issue we are facing..

However your suggestion to pump the data straight to elasticsearch with document_id in output worked for us. But we are using redis/logstash in between for creating time based indexes to keep everything standard.

Please do let me know.

Thanks
Vikram Yerneni

Hi Magnusbaeck, we fixed the issue with adding a if statement in the output section of logstash and added the document_id portion over there..
Here is the link.. I believe u r the one who suggested this over there: http://stackoverflow.com/questions/30226940/logstash-output-to-elasticsearch-with-document-id-what-to-do-when-i-dont-have

Thanks man..

1 Like