cris
February 11, 2022, 12:58am
1
Hello community!
I am working with logstash to reindex some data, I have a problem when I run logstash.service the documents increase a lot in the new index, in the original index I have 3000 documents and I expect like 12000 but in the new index it never stops increasing the number of documents It's like infinity. But when I run only
/usr/share/logstash/bin/logstash --path.settings /etc/logstash/ -f /etc/logstash/conf.d/logstash.conf
is workin fine This is my logstash setup.
input {
elasticsearch {
hosts => "localhost:9200"
index => "testingService"
size => 1
scroll => "5m"
docinfo => true
}
}
filter{
ruby {
code => '
array2 = []
env1 = event.get("[Spain][Testing][status]")
env2 = event.get("[USA][Testing][status]")
env3 = event.get("[Brazil][Testing][status]")
env4 = event.get("[London][Testing][status]")
array2 << {"status": env1}
array2 << {"status": env2}
array2 << {"status":env3}
array2 << {"status": env4}
event.to_hash.keys.each{|k|
if !(k.start_with?("@","timelocal")) then
event.remove(k)
end
}
event.set("[Login]", array2)
'
}
split { field => '[Login]' }
}
output {
elasticsearch {
hosts => "localhost:9200"
index => "reindex-data"
}
stdout {
codec => rubydebug
}
}
Thanks for your time
Tomo_M
(Tomohiro Mitani)
February 11, 2022, 7:33am
2
It may be not infinity, but index the whole documents again as you run the Logstash pipeline once.
Every time elasticsearch output plugin
send a new document to reindex-data
index of Elasticsearch, a unique _id
for the document is generated. There is no automatic tracking function.
To avoid duplication, you have to query source documents only which has not been indexed or use fingerprint filter plugin and set the fingerprint to id of the output document to identify same documents.
cris
February 15, 2022, 3:01am
3
I was trying with fingerprint filter buy now I only get 1 hit because I have always the same id.
fingerprint {
source => ["status"]
target => "[@metadata][generated_id]"
concatenate_sources => true
}
}
output {
elasticsearch {
hosts => "localhost:9200"
index => "prueba-reindex17"
document_id => "%{[@metadata][fingerprint]}"
How can I assigned the id per each split?
Tomo_M
(Tomohiro Mitani)
February 15, 2022, 3:11am
4
output {
elasticsearch {
hosts => "localhost:9200"
index => "prueba-reindex17"
document_id => "%{[@metadata][generated_id]}"
cris
February 15, 2022, 3:28am
5
Sorry I did not see my error, but I tried now with generated_id but I get the same , only one hit
Tomo_M
(Tomohiro Mitani)
February 15, 2022, 3:41am
6
Please share the output using stdout output plugin with rubydebug codec.
cris
February 15, 2022, 4:41am
7
I only get on rubydebug
{
"localTime" => "2022-02-13T13:28:12.947-06:00",
"@timestamp" => 2022-02-15T04:39:01.548Z,
"@version" => "1",
"Login" => {
"status" => "passed"
}
}
{
"localTime" => "2022-02-13T13:28:12.947-06:00",
"@timestamp" => 2022-02-15T04:39:01.548Z,
"@version" => "1",
"Login" => {
"status" => "passed"
}
}
{
"localTime" => "2022-02-13T13:28:12.947-06:00",
"@timestamp" => 2022-02-15T04:39:01.548Z,
"@version" => "1",
"Login" => {
"status" => "passed"
}
}
{
"localTime" => "2022-02-13T13:28:12.947-06:00",
"@timestamp" => 2022-02-15T04:39:01.548Z,
"@version" => "1",
"Login" => {
"status" => "passed"
}
}
Tomo_M
(Tomohiro Mitani)
February 15, 2022, 5:10am
8
Because status is identical, fingerprints should be identical. Fingerprint source have to be identical to the events what you want to deduplicate. You need some more fields.
cris
February 15, 2022, 6:31am
9
Ok, I added other field in the ruby code, like this:
array2 << {"status": env4, "instance": "Spain"}
array2 << {"status": env4, "instance": "USA"}
array2 << {"status": env4, "instance": "Brazil"}
array2 << {"status": env4, "instance": "London"}
and in the fingerprint I change the source for
fingerprint {
source => ["instance"]
target => "[@metadata][generated_id]"
concatenate_sources => true
}
but I get again other 1 hit
This is the rubydebug
{
"@timestamp" => 2022-02-15T06:28:14.339Z,
"localTime" => "2022-02-14T21:07:40.329-06:00",
"Login" => {
"status" => "passed",
"instance" => "Spain"
},
"@version" => "1"
}
{
"@timestamp" => 2022-02-15T06:28:14.339Z,
"localTime" => "2022-02-14T21:07:40.329-06:00",
"Login" => {
"status" => "passed",
"instance" => "USA"
},
"@version" => "1"
}
{
"@timestamp" => 2022-02-15T06:28:14.339Z,
"localTime" => "2022-02-14T21:07:40.329-06:00",
"Login" => {
"status" => "passed",
"instance" => "Brazil"
},
"@version" => "1"
}
{
"@timestamp" => 2022-02-15T06:28:14.339Z,
"localTime" => "2022-02-14T21:07:40.329-06:00",
"Login" => {
"status" => "passed",
"instance" => "London"
},
"@version" => "1"
}
Is correct the source option in fingerprint ?
Tomo_M
(Tomohiro Mitani)
February 15, 2022, 7:09am
10
Turn on metadata and check [@metadata][generated_id]
is different for different messages.
output {
stdout { codec => rubydebug {metadata => true } }
}
And if you use "instance" as the source, you'll get only 4 documents for Spain, USA, Brazil, London. Is it your intention? It depends on you what messages are same and what messages are different.
cris
February 15, 2022, 7:25am
11
oh yes! I get the same generated_id in all documents
}
{
"localTime" => "2022-02-13T07:38:34.758-06:00",
"@timestamp" => 2022-02-15T07:16:16.887Z,
"@metadata" => {
"_index" => "testingService",
"_id" => "8ENO834BWwKzUDWhfhep",
"_type" => "_doc",
"generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
},
"@version" => "1",
"Login" => {
"instancia" => "Spain",
"status" => "passed"
}
}
{
"localTime" => "2022-02-13T07:38:34.758-06:00",
"@timestamp" => 2022-02-15T07:16:16.887Z,
"@metadata" => {
"_index" => "testingService",
"_id" => "8ENO834BWwKzUDWhfhep",
"_type" => "_doc",
"generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
},
"@version" => "1",
"Login" => {
"instancia" => "USA",
"status" => "passed"
}
}
{
"localTime" => "2022-02-13T07:38:34.758-06:00",
"@timestamp" => 2022-02-15T07:16:16.887Z,
"@metadata" => {
"_index" => "testingService",
"_id" => "8ENO834BWwKzUDWhfhep",
"_type" => "_doc",
"generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
},
"@version" => "1",
"Login" => {
"instancia" => "Brazil",
"status" => "passed"
}
}
{
"localTime" => "2022-02-13T07:38:34.758-06:00",
"@timestamp" => 2022-02-15T07:16:16.887Z,
"@metadata" => {
"_index" => "testingService",
"_id" => "8ENO834BWwKzUDWhfhep",
"_type" => "_doc",
"generated_id" => "7b20fc900c1341ab65ed59a65fcb535c33059315"
},
"@version" => "1",
"Login" => {
"instancia" => "London",
"status" => "passed"
}
}
I pretend get 4 different hits per each hit in the original index. One hit for Spain, othe for USA... etc
Tomo_M
(Tomohiro Mitani)
February 15, 2022, 3:45pm
12
FIRST, you have to decide what information you want to deduplicate documents on.
Then set the information to the source of fingerprint
.
source
could be multiple fields.
Something like the following could be the solution.
fingerprint {
source => ["localTime","instance"]
target => "[@metadata][generated_id]"
concatenate_sources => true
}
Badger
February 15, 2022, 5:14pm
13
Try source => [ "[Login][instancia]" ]
1 Like
cris
February 16, 2022, 12:21am
14
Tomo_M:
"localTime"
@Badger @Tomo_M
Thanks for your help, I solved merging both suggestions
1 Like
system
(system)
Closed
March 16, 2022, 12:21am
15
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.