Reindex data is multiplying docs

cris · February 11, 2022, 12:58am

Hello community!
I am working with logstash to reindex some data, I have a problem when I run logstash.service the documents increase a lot in the new index, in the original index I have 3000 documents and I expect like 12000 but in the new index it never stops increasing the number of documents It's like infinity. But when I run only

/usr/share/logstash/bin/logstash --path.settings /etc/logstash/ -f /etc/logstash/conf.d/logstash.conf

is workin fine This is my logstash setup.

input {
  elasticsearch {
    hosts => "localhost:9200"
    index => "testingService"
    size => 1
   scroll => "5m"
    docinfo => true
  }
}
filter{
ruby {
           code => '

array2 = []
env1 = event.get("[Spain][Testing][status]")
env2 = event.get("[USA][Testing][status]")
env3 = event.get("[Brazil][Testing][status]")
env4 = event.get("[London][Testing][status]")
array2 << {"status": env1}
array2 << {"status": env2}
array2 << {"status":env3}
array2 << {"status": env4}
      event.to_hash.keys.each{|k|
        if !(k.start_with?("@","timelocal")) then
          event.remove(k)
        end
      }
      event.set("[Login]", array2)

'
    }
split { field => '[Login]' }

}
output {
  elasticsearch {
    hosts => "localhost:9200"
    index => "reindex-data"
  }
 stdout {
   codec => rubydebug
   }

}

Thanks for your time

Tomo_M · February 11, 2022, 7:33am

It may be not infinity, but index the whole documents again as you run the Logstash pipeline once.

Every time elasticsearch output plugin send a new document to reindex-data index of Elasticsearch, a unique _id for the document is generated. There is no automatic tracking function.

To avoid duplication, you have to query source documents only which has not been indexed or use fingerprint filter plugin and set the fingerprint to id of the output document to identify same documents.

cris · February 15, 2022, 3:01am

I was trying with fingerprint filter buy now I only get 1 hit because I have always the same id.

fingerprint {
        source => ["status"]
        target => "[@metadata][generated_id]"
        concatenate_sources => true
  }
}
output {
  elasticsearch {
    hosts => "localhost:9200"
    index => "prueba-reindex17"
   document_id => "%{[@metadata][fingerprint]}"

How can I assigned the id per each split?

Tomo_M · February 15, 2022, 3:11am

output {
  elasticsearch {
    hosts => "localhost:9200"
    index => "prueba-reindex17"
   document_id => "%{[@metadata][generated_id]}"

cris · February 15, 2022, 3:28am

Sorry I did not see my error, but I tried now with generated_id but I get the same , only one hit

Tomo_M · February 15, 2022, 3:41am

Please share the output using stdout output plugin with rubydebug codec.

cris · February 15, 2022, 4:41am

I only get on rubydebug

{
     "localTime" => "2022-02-13T13:28:12.947-06:00",
    "@timestamp" => 2022-02-15T04:39:01.548Z,
      "@version" => "1",
         "Login" => {
        "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T13:28:12.947-06:00",
    "@timestamp" => 2022-02-15T04:39:01.548Z,
      "@version" => "1",
         "Login" => {
        "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T13:28:12.947-06:00",
    "@timestamp" => 2022-02-15T04:39:01.548Z,
      "@version" => "1",
         "Login" => {
        "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T13:28:12.947-06:00",
    "@timestamp" => 2022-02-15T04:39:01.548Z,
      "@version" => "1",
         "Login" => {
        "status" => "passed"
    }
}

Tomo_M · February 15, 2022, 5:10am

Because status is identical, fingerprints should be identical. Fingerprint source have to be identical to the events what you want to deduplicate. You need some more fields.

cris · February 15, 2022, 6:31am

Ok, I added other field in the ruby code, like this:

array2 << {"status": env4, "instance": "Spain"}
array2 << {"status": env4, "instance": "USA"}
array2 << {"status": env4, "instance": "Brazil"}
array2 << {"status": env4, "instance": "London"}

and in the fingerprint I change the source for

fingerprint {
        source => ["instance"]
        target => "[@metadata][generated_id]"
        concatenate_sources => true
  }

but I get again other 1 hit
This is the rubydebug

{
    "@timestamp" => 2022-02-15T06:28:14.339Z,
     "localTime" => "2022-02-14T21:07:40.329-06:00",
         "Login" => {
           "status" => "passed",
        "instance" => "Spain"
    },
      "@version" => "1"
}
{
    "@timestamp" => 2022-02-15T06:28:14.339Z,
     "localTime" => "2022-02-14T21:07:40.329-06:00",
         "Login" => {
           "status" => "passed",
        "instance" => "USA"
    },
      "@version" => "1"
}
{
    "@timestamp" => 2022-02-15T06:28:14.339Z,
     "localTime" => "2022-02-14T21:07:40.329-06:00",
         "Login" => {
           "status" => "passed",
        "instance" => "Brazil"
    },
      "@version" => "1"
}
{
    "@timestamp" => 2022-02-15T06:28:14.339Z,
     "localTime" => "2022-02-14T21:07:40.329-06:00",
         "Login" => {
           "status" => "passed",
        "instance" => "London"
    },
      "@version" => "1"
}

Is correct the source option in fingerprint ?

Tomo_M · February 15, 2022, 7:09am

Turn on metadata and check [@metadata][generated_id] is different for different messages.

output {
  stdout { codec => rubydebug {metadata => true } }
}

And if you use "instance" as the source, you'll get only 4 documents for Spain, USA, Brazil, London. Is it your intention? It depends on you what messages are same and what messages are different.

cris · February 15, 2022, 7:25am

oh yes! I get the same generated_id in all documents

}
{
     "localTime" => "2022-02-13T07:38:34.758-06:00",
    "@timestamp" => 2022-02-15T07:16:16.887Z,
     "@metadata" => {
              "_index" => "testingService",
                 "_id" => "8ENO834BWwKzUDWhfhep",
               "_type" => "_doc",
        "generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
    },
      "@version" => "1",
         "Login" => {
        "instancia" => "Spain",
           "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T07:38:34.758-06:00",
    "@timestamp" => 2022-02-15T07:16:16.887Z,
     "@metadata" => {
              "_index" => "testingService",
                 "_id" => "8ENO834BWwKzUDWhfhep",
               "_type" => "_doc",
        "generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
    },
      "@version" => "1",
         "Login" => {
        "instancia" => "USA",
           "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T07:38:34.758-06:00",
    "@timestamp" => 2022-02-15T07:16:16.887Z,
     "@metadata" => {
              "_index" => "testingService",
                 "_id" => "8ENO834BWwKzUDWhfhep",
               "_type" => "_doc",
        "generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
    },
      "@version" => "1",
         "Login" => {
        "instancia" => "Brazil",
           "status" => "passed"
    }
}
{
     "localTime" => "2022-02-13T07:38:34.758-06:00",
    "@timestamp" => 2022-02-15T07:16:16.887Z,
     "@metadata" => {
              "_index" => "testingService",
                 "_id" => "8ENO834BWwKzUDWhfhep",
               "_type" => "_doc",
        "generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
    },
      "@version" => "1",
         "Login" => {
        "instancia" => "London",
           "status" => "passed"
    }
}

I pretend get 4 different hits per each hit in the original index. One hit for Spain, othe for USA... etc

Tomo_M · February 15, 2022, 3:45pm

FIRST, you have to decide what information you want to deduplicate documents on.

Then set the information to the source of fingerprint.

source could be multiple fields.

Something like the following could be the solution.

fingerprint {
        source => ["localTime","instance"]
        target => "[@metadata][generated_id]"
        concatenate_sources => true
  }

Badger · February 15, 2022, 5:14pm

Try source => [ "[Login][instancia]" ]

cris · February 16, 2022, 12:21am

@Badger @Tomo_M
Thanks for your help, I solved merging both suggestions

system · March 16, 2022, 12:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch input plugin creates more documents than what is in the originating index Logstash	2	519	November 12, 2019
Reindexing with logstash Logstash	2	890	December 20, 2016
Reindex with Logstash (Elasticsearch --> Filter --> Elasticsearch) looses data on the way Logstash	2	1409	January 11, 2018
Logstash with elasticsearch input and output keep looping results forever Logstash	3	886	March 22, 2019
No effect of "size" in 'query' while reindexing in elasticsearch Logstash	2	607	July 6, 2017

Reindex data is multiplying docs

Related topics