Logstash parsing same contents again and again

Hi All,
I am having a weird problem. Logstash is parsing and indexing same contents of a file over and again after a certain time period. Sincedb_path is note created. and if i let logstash run for days it the number of entries keep multiplying.

Please help what could be the problem

Regards
Tahir

Do I understand it correctly: are you using logstash to read the file?

I currently use filebeat only and avoid duplicate entries by creating a sha256 hash (fingerprint filter) from the original message and use it as the unique document_id.

What does your config look like?

Here is my config file ->

input {
file {
path => "/home/xsalllowed/Desktop/xxxxx/data/a/y.part"
start_position => "beginning"
sincedb_path => "/var/null"
max_open_files => 400
sincedb_write_interval => 2
}
}
filter {
csv {
separator => ":"
columns => ["email","password"]
}
}
output {
elasticsearch {
hosts => "http://localhost:9200"
index => "breachcompilation"
}
stdout {}
}

sincedb_path => "/var/null"

What's this supposed to mean? Set it to a reasonable path that writable or leave it unset.

Doesn't matter which value I keep for sincedb_path or leave it unset, I end up getting same data after 7 minutes.

I just saw something more in my logs. There is a new error when i tried to give large files for checking. Any specific reason for this ERROR:

[2018-03-31T22:51:30,044][INFO ][logstash.agent ] Pipelines running {:count=>1, :pipelines=>["main"]}
[2018-03-31T22:51:36,533][ERROR][org.logstash.Logstash ] java.lang.OutOfMemoryError: Java heap space
[2018-03-31T22:51:51,666][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"fb_apache", :directory=>"/usr/share/logstash/modules/fb_apache/configuration"}
[2018-03-31T22:51:51,674][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"netflow", :directory=>"/usr/share/logstash/modules/netflow/configuration"}
[2018-03-31T22:51:51,936][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2018-03-31T22:51:52,063][INFO ][logstash.runner ] Starting Logstash {"logstash.version"=>"6.2.3"}
[2018-03-31T22:51:52,165][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}
[2018-03-31T22:51:52,825][INFO ][logstash.pipeline ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2018-03-31T22:51:53,088][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://localhost:9200/]}}
[2018-03-31T22:51:53,090][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://localhost:9200/, :path=>"/"}
[2018-03-31T22:51:53,196][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://localhost:9200/"}
[2018-03-31T22:51:53,241][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>6}
[2018-03-31T22:51:53,241][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the type event field won't be used to determine the document _type {:es_version=>6}
[2018-03-31T22:51:53,243][INFO ][logstash.outputs.elasticsearch] Using mapping template from {:path=>nil}
[2018-03-31T22:51:53,247][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"default"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date"}, "@version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2018-03-31T22:51:53,255][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["http://localhost:9200"]}
[2018-03-31T22:51:53,447][INFO ][logstash.pipeline ] Pipeline started succesfully {:pipeline_id=>"main", :thread=>"#<Thread:0x16ccb605@/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:246 sleep>"}
[2018-03-31T22:51:53,466][INFO ][logstash.agent ] Pipelines running {:count=>1, :pipelines=>["main"]}
[2018-03-31T22:52:00,334][ERROR][org.logstash.Logstash ] java.lang.OutOfMemoryError: Java heap space
[2018-03-31T22:52:18,770][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"fb_apache", :directory=>"/usr/share/logstash/modules/fb_apache/configuration"}
[2018-03-31T22:52:18,777][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"netflow", :directory=>"/usr/share/logstash/modules/netflow/configuration"}
[2018-03-31T22:52:19,173][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2018-03-31T22:52:19,288][INFO ][logstash.runner ] Starting Logstash {"logstash.version"=>"6.2.3"}
[2018-03-31T22:52:19,375][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}
[2018-03-31T22:52:20,076][INFO ][logstash.pipeline ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2018-03-31T22:52:20,291][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://localhost:9200/]}}
[2018-03-31T22:52:20,294][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://localhost:9200/, :path=>"/"}
[2018-03-31T22:52:20,400][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://localhost:9200/"}
[2018-03-31T22:52:20,445][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>6}
[2018-03-31T22:52:20,447][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the type event field won't be used to determine the document _type {:es_version=>6}
[2018-03-31T22:52:20,454][INFO ][logstash.outputs.elasticsearch] Using mapping template from {:path=>nil}
[2018-03-31T22:52:20,458][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"default"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date"}, "@version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2018-03-31T22:52:20,472][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["http://localhost:9200"]}
[2018-03-31T22:52:20,829][INFO ][logstash.pipeline ] Pipeline started succesfully {:pipeline_id=>"main", :thread=>"#<Thread:0x7cfe2529@/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:246 sleep>"}

What did you name this config file?

The logstash JVM (process) ran out of memory space. It can happen when it's working on large data and doesn't have enough memory to process it. Increase it in <logstash_install_dir>/config/jvm.options file. Eg. -Xmx2g will set it to 2GB. Setting -Xms (minimum memory set) and -Xmx (maximum memory that can be allocated) to the same is recommended by guides.

I presume you are being the File input?

The File input yses the sincedb to keep track of what it has read so that it can avoid reprocessing the same messages over and over again.

What does your configuration look like for the file input? Does the user under which your process is running have read and write access to the place where it is trying to have a sincedb?

Are the files Logstash needs to read from on a network volume or other secondary mount? The sincedb uses the actual inode reference for tracking position (not just the path), so remounting the partition or rewriting the files (even if with identical contents) can cause the file paths to point to new inodes, which prevents Logstash from reliably remembering position.

My guess is that your JVM is running out of heap before Logstash has had the opportunity to write the sincedb file and record how much it has processed. Fix the JVM problem and the rest will probably be fine.

1 Like

Thanks alot Atira,
Your solution worked but after sometime it started re-indexing the same data over and again. Wonder what could be the issue. Since I have bot set the sincedb_path I cannot check and confirm if it is being written or not?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.