How to avoid elasticsearch duplicate documents

How do I avoid elasticsearch duplicate documents?

The elasticsearch index docs count (20,010,253) doesn’t match with logs line count (13,411,790).

documentation:
File input plugin
File rotation is detected and handled by this input, regardless of whether the file is rotated via a rename or a copy operation.

nifi:

real time nifi pipeline copies logs from nifi server to elk server. 
nifi has rolling log files.
copying files take less than one minute.

logs line count on elk server:

wc -l /mnt/elk/logstash/data/from/nifi/dev/logs/nifi/*.log
  13,411,790 total 

elasticsearch index docs count:

curl -XGET 'ip:9200/_cat/indices?v&pretty'
docs.count = 20,010,253 

logstash input conf file:

cat /mnt/elk/logstash/input_conf_files/test_4.conf
input {
file {
path => "/mnt/elk/logstash/data/from/nifi/dev/logs/nifi/*.log"
type => "test_4"
sincedb_path => "/mnt/elk/logstash/scripts/sincedb/test_4"
}
}
filter {
if [type] == "test_4" {
grok {
match => {
"message" => "%{DATE:date} %{TIME:time} %{WORD:EventType} %{GREEDYDATA:EventText}"
}
}
}
}
output {
if [type] == "test_4" {
elasticsearch {
hosts => "ip:9200"
index => "test_4"
}
}
else {
stdout {
codec => rubydebug
}
}
}

This is one example of duplicates.

There is one entry in log files.

grep -r "2018-02-02 11:31:36,978 ERROR" /mnt/elk/logstash/data/from/nifi/dev/logs/nifi/*.log

/mnt/elk/logstash/data/from/nifi/dev/logs/nifi/nifi-app_2018-02-02_11.0.log:2018-02-02 11:31:36,978 ERROR [Timer-Driven Process Thread-7] o.a.n.p.a.storage.PutAzureBlobStorage PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] failed to process due to org.apache.nifi.processor.exception.ProcessException: IOException thrown from PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288]: java.io.IOException; rolling back session: {}

There are four entries in elasticsearch. One entry has path "nifi-app_2018-02-02_11.0.log". The three entries have path "nifi-app.log". The nifi-app.log is a revolving file. I have removed fourth entry because of blog message "Body is limited to 7000 characters; you entered 7948".

curl -XGET '10.19.19.33:9200/from_nifi_dev_logs_nifi_4/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "query_string": {
            "query": " (date:\"18-02-02\") AND (time:\"11:31:36,978\")  AND (EventType:\"ERROR\") "
        }
    }
}
'

{
  "_index" : "test_4",
  "_type" : "test_4",
  "_id" : "IMQcWGEBOC31Kjf9gyWS",
  "_score" : 18.249443,
  "_source" : {
    "date" : "18-02-02",
    "path" : "/mnt/elk/logstash/data/from/nifi/dev/logs/nifi/nifi-app_2018-02-02_11.0.log",
    "@timestamp" : "2018-02-02T20:01:59.159Z",
    "EventType" : "ERROR",
    "EventText" : "[Timer-Driven Process Thread-7] o.a.n.p.a.storage.PutAzureBlobStorage PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] failed to process due to org.apache.nifi.processor.exception.ProcessException: IOException thrown from PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288]: java.io.IOException; rolling back session: {}",
    "@version" : "1",
    "host" : "hostname",
    "time" : "11:31:36,978",
    "message" : "2018-02-02 11:31:36,978 ERROR [Timer-Driven Process Thread-7] o.a.n.p.a.storage.PutAzureBlobStorage PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] failed to process due to org.apache.nifi.processor.exception.ProcessException: IOException thrown from PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288]: java.io.IOException; rolling back session: {}",
    "type" : "test_4"
  }
},
{
  "_index" : "test_4",
  "_type" : "test_4",
  "_id" : "CMEFWGEBOC31Kjf9ZD-n",
  "_score" : 18.249443,
  "_source" : {
    "date" : "18-02-02",
    "path" : "/mnt/elk/logstash/data/from/nifi/dev/logs/nifi/nifi-app.log",
    "@timestamp" : "2018-02-02T19:36:43.919Z",
    "EventType" : "ERROR",
    "EventText" : "[Timer-Driven Process Thread-7] o.a.n.p.a.storage.PutAzureBlobStorage PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] failed to process due to org.apache.nifi.processor.exception.ProcessException: IOException thrown from PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288]: java.io.IOException; rolling back session: {}",
    "@version" : "1",
    "host" : "hostname",
    "time" : "11:31:36,978",
    "message" : "2018-02-02 11:31:36,978 ERROR [Timer-Driven Process Thread-7] o.a.n.p.a.storage.PutAzureBlobStorage PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] failed to process due to org.apache.nifi.processor.exception.ProcessException: IOException thrown from PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288]: java.io.IOException; rolling back session: {}",
    "type" : "test_4"
  }
},
{
  "_index" : "test_4",
  "_type" : "test_4",
  "_id" : "8cAAWGEBOC31Kjf90X7K",
  "_score" : 17.824947,
  "_source" : {
    "date" : "18-02-02",
    "path" : "/mnt/elk/logstash/data/from/nifi/dev/logs/nifi/nifi-app.log",
    "@timestamp" : "2018-02-02T19:31:44.177Z",
    "EventType" : "ERROR",
    "EventText" : "[Timer-Driven Process Thread-7] o.a.n.p.a.storage.PutAzureBlobStorage PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] failed to process due to org.apache.nifi.processor.exception.ProcessException: IOException thrown from PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288]: java.io.IOException; rolling back session: {}",
    "@version" : "1",
    "host" : "hostname",
    "time" : "11:31:36,978",
    "message" : "2018-02-02 11:31:36,978 ERROR [Timer-Driven Process Thread-7] o.a.n.p.a.storage.PutAzureBlobStorage PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288] failed to process due to org.apache.nifi.processor.exception.ProcessException: IOException thrown from PutAzureBlobStorage[id=117f16f0-113c-1fcd-6a48-d9d99d3cd288]: java.io.IOException; rolling back session: {}",
    "type" : "test_4"
  }
},

Have you perhaps indexed the same lines multiple times?

@magnusbaeck, thanks, I have edited the post.

Okay, but do you have duplication? Which lines? Is there a pattern? Judging solely by your configuration and description of what's going on it's not clear why you'd have duplication. How long does the copying of the log files take? How are the files copied?

@magnusbaeck ,

Thanks.

I have edited the post.

do you have duplication?
Yes.

Which lines?
I have added one example in original post.

Is there a pattern?
There are four entries in elasticsearch. One entry has path "nifi-app_2018-02-02_11.0.log". The three entries have path "nifi-app.log". The nifi-app.log is the revolving file.

How long does the copying of the log files take?
It takes less than one minute.

How are the files copied?
real time nifi pipeline copies logs from nifi server to elk server.

What probably happens here is that Logstash sees half-copied log files and starts to process them. I suggest you copy the files to whatever.log.new and rename it to whatever.log once the copy operation has completed.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.