Reindex data is multiplying docs

Hello community!
I am working with logstash to reindex some data, I have a problem when I run logstash.service the documents increase a lot in the new index, in the original index I have 3000 documents and I expect like 12000 but in the new index it never stops increasing the number of documents It's like infinity. But when I run only

/usr/share/logstash/bin/logstash --path.settings /etc/logstash/ -f /etc/logstash/conf.d/logstash.conf

is workin fine This is my logstash setup.

input {
  elasticsearch {
    hosts => "localhost:9200"
    index => "testingService"
    size => 1
   scroll => "5m"
    docinfo => true
  }
}
filter{
ruby {
           code => '

array2 = []
env1 = event.get("[Spain][Testing][status]")
env2 = event.get("[USA][Testing][status]")
env3 = event.get("[Brazil][Testing][status]")
env4 = event.get("[London][Testing][status]")
array2 << {"status": env1}
array2 << {"status": env2}
array2 << {"status":env3}
array2 << {"status": env4}
      event.to_hash.keys.each{|k|
        if !(k.start_with?("@","timelocal")) then
          event.remove(k)
        end
      }
      event.set("[Login]", array2)

'
    }
split { field => '[Login]' }

}
output {
  elasticsearch {
    hosts => "localhost:9200"
    index => "reindex-data"
  }
 stdout {
   codec => rubydebug
   }

}

Thanks for your time

It may be not infinity, but index the whole documents again as you run the Logstash pipeline once.

Every time elasticsearch output plugin send a new document to reindex-data index of Elasticsearch, a unique _id for the document is generated. There is no automatic tracking function.

To avoid duplication, you have to query source documents only which has not been indexed or use fingerprint filter plugin and set the fingerprint to id of the output document to identify same documents.

I was trying with fingerprint filter buy now I only get 1 hit because I have always the same id.

fingerprint {
        source => ["status"]
        target => "[@metadata][generated_id]"
        concatenate_sources => true
  }
}
output {
  elasticsearch {
    hosts => "localhost:9200"
    index => "prueba-reindex17"
   document_id => "%{[@metadata][fingerprint]}"

How can I assigned the id per each split?

output {
  elasticsearch {
    hosts => "localhost:9200"
    index => "prueba-reindex17"
   document_id => "%{[@metadata][generated_id]}"

Sorry I did not see my error, but I tried now with generated_id but I get the same , only one hit

Please share the output using stdout output plugin with rubydebug codec.

I only get on rubydebug

{
     "localTime" => "2022-02-13T13:28:12.947-06:00",
    "@timestamp" => 2022-02-15T04:39:01.548Z,
      "@version" => "1",
         "Login" => {
        "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T13:28:12.947-06:00",
    "@timestamp" => 2022-02-15T04:39:01.548Z,
      "@version" => "1",
         "Login" => {
        "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T13:28:12.947-06:00",
    "@timestamp" => 2022-02-15T04:39:01.548Z,
      "@version" => "1",
         "Login" => {
        "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T13:28:12.947-06:00",
    "@timestamp" => 2022-02-15T04:39:01.548Z,
      "@version" => "1",
         "Login" => {
        "status" => "passed"
    }
}

Because status is identical, fingerprints should be identical. Fingerprint source have to be identical to the events what you want to deduplicate. You need some more fields.

Ok, I added other field in the ruby code, like this:

array2 << {"status": env4, "instance": "Spain"}
array2 << {"status": env4, "instance": "USA"}
array2 << {"status": env4, "instance": "Brazil"}
array2 << {"status": env4, "instance": "London"}

and in the fingerprint I change the source for

fingerprint {
        source => ["instance"]
        target => "[@metadata][generated_id]"
        concatenate_sources => true
  }

but I get again other 1 hit :frowning:
This is the rubydebug

{
    "@timestamp" => 2022-02-15T06:28:14.339Z,
     "localTime" => "2022-02-14T21:07:40.329-06:00",
         "Login" => {
           "status" => "passed",
        "instance" => "Spain"
    },
      "@version" => "1"
}
{
    "@timestamp" => 2022-02-15T06:28:14.339Z,
     "localTime" => "2022-02-14T21:07:40.329-06:00",
         "Login" => {
           "status" => "passed",
        "instance" => "USA"
    },
      "@version" => "1"
}
{
    "@timestamp" => 2022-02-15T06:28:14.339Z,
     "localTime" => "2022-02-14T21:07:40.329-06:00",
         "Login" => {
           "status" => "passed",
        "instance" => "Brazil"
    },
      "@version" => "1"
}
{
    "@timestamp" => 2022-02-15T06:28:14.339Z,
     "localTime" => "2022-02-14T21:07:40.329-06:00",
         "Login" => {
           "status" => "passed",
        "instance" => "London"
    },
      "@version" => "1"
}

Is correct the source option in fingerprint :thinking: ?

Turn on metadata and check [@metadata][generated_id] is different for different messages.

output {
  stdout { codec => rubydebug {metadata => true } }
}

And if you use "instance" as the source, you'll get only 4 documents for Spain, USA, Brazil, London. Is it your intention? It depends on you what messages are same and what messages are different.

oh yes! I get the same generated_id in all documents

}
{
     "localTime" => "2022-02-13T07:38:34.758-06:00",
    "@timestamp" => 2022-02-15T07:16:16.887Z,
     "@metadata" => {
              "_index" => "testingService",
                 "_id" => "8ENO834BWwKzUDWhfhep",
               "_type" => "_doc",
        "generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
    },
      "@version" => "1",
         "Login" => {
        "instancia" => "Spain",
           "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T07:38:34.758-06:00",
    "@timestamp" => 2022-02-15T07:16:16.887Z,
     "@metadata" => {
              "_index" => "testingService",
                 "_id" => "8ENO834BWwKzUDWhfhep",
               "_type" => "_doc",
        "generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
    },
      "@version" => "1",
         "Login" => {
        "instancia" => "USA",
           "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T07:38:34.758-06:00",
    "@timestamp" => 2022-02-15T07:16:16.887Z,
     "@metadata" => {
              "_index" => "testingService",
                 "_id" => "8ENO834BWwKzUDWhfhep",
               "_type" => "_doc",
        "generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
    },
      "@version" => "1",
         "Login" => {
        "instancia" => "Brazil",
           "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T07:38:34.758-06:00",
    "@timestamp" => 2022-02-15T07:16:16.887Z,
     "@metadata" => {
              "_index" => "testingService",
                 "_id" => "8ENO834BWwKzUDWhfhep",
               "_type" => "_doc",
        "generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
    },
      "@version" => "1",
         "Login" => {
        "instancia" => "London",
           "status" => "passed"
    }
}

I pretend get 4 different hits per each hit in the original index. One hit for Spain, othe for USA... etc

FIRST, you have to decide what information you want to deduplicate documents on.

Then set the information to the source of fingerprint.

source could be multiple fields.

Something like the following could be the solution.

fingerprint {
        source => ["localTime","instance"]
        target => "[@metadata][generated_id]"
        concatenate_sources => true
  }

Try source => [ "[Login][instancia]" ]

1 Like

@Badger @Tomo_M
Thanks for your help, I solved merging both suggestions

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.