Consuming 3x 2GB text logs

joaociocca · October 16, 2019, 4:22am

so, I have this huge files. I tried using read mode file input, but nothing happened, I tried piping the content (like I've learned from @dadoonet over here https://www.elastic.co/pt/blog/enriching-your-postal-addresses-with-the-elastic-stack-part-1), but since the files are so huge I'm running out of memory even before the pipe can start logstash.

Any ideas?

stephenb · October 16, 2019, 4:55am

Hi @joaociocca

Can you please post the error messages if any and format them with the formatter </> button.

Also post your logstash config as well (formatted using </> )

Large files should not really be a problem... Sometimes I just cat off the first 1000 rows or so and get it working before I try the big file.

Logstash can take a while to start...

joaociocca · October 16, 2019, 5:22am

I left it running read mode for almost an hour, with nothing showing up on console past logstash's startup lines.
When I tried pipe, after some five minutes the process already had eaten up twice my RAM available.

I've just got home, tomorrow when I get back I'll post the config, but it's a simple file input, CSV filter, elasticsearch output conf.

When building, I ran some 10 lines through with Ruby debug output, and it was fine.
((edit))

here's the simplified w/o comments conf:

input {  
  file {
    path => "(correct file path with escaped \ where needed)"
    mode => ["read"]
  }
}
filter {  
  if [message] =~ "^#" {
    drop {}
  }
  mutate {
    gsub => ["message","-",""]
  }
  csv {
    columns => ["c-ip","cs-username","c-agent","sc-authenticated","date","time","s-computername","cs-referred","r-host","r-ip","r-port","time-taken","sc-bytes","cs-bytes","cs-protocol","s-operation","cs-uri","cs-mime-type","s-object-source","sc-status","rule","FilterInfo","sc-network","error-info","action","GMT Time","AuthenticationServer","ThreatName","UrlCategory","MalwareInspectionContentDeliveryMethod","MalwareInspectionDuration","internal-service-info","NIS application protocol","UrlCategorizationReason","SessionType","UrlDestHost","s-port","SoftBlockAction"]
    separator => "	"
  }
  date {
    match => [ "GMT Time", "YYYYMMdd HH:mm:ss" ]
      timezone => "America/Sao_Paulo"
  }
  if [bytesSent] {
    ruby {
      code => "event['kilobytesSent'] = event['bytesSent'].to_i / 1024.0"
    }
  }
  if [bytesReceived] {
    ruby {
      code => "event['kilobytesReceived'] = event['bytesReceived'].to_i / 1024.0"
    }
  }
  mutate {
    convert => ["bytesSent", "integer"]
    convert => ["bytesReceived", "integer"]
    convert => ["timetaken", "integer"]
    add_field => { "clientHostname" => "%{r-ip}" }
    remove_field => [ "GMT Time"]
  }
  dns {
    action => "replace"
    reverse => ["clientHostname"]
  }
    useragent {
        source=> "useragent"
        prefix=> "browser"
    }
}
output {  
  elasticsearch{
    hosts => ["http://ip:9200"]
    index => "index-%{+YYYY.MM.dd}"
  }
  # stdout {codec => rubydebug}
}

As I've said, testing with rubydebug stdout with the first 5-20 lines of each file shows the correct result is being processed.

I just started it again and here's the log:

Sending Logstash logs to /logstash-6.8.0/logs which is now configured via log4j2.properties
[2019-10-16T17:19:22,398][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2019-10-16T17:19:22,460][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.8.0"}
[2019-10-16T17:19:41,384][INFO ][logstash.pipeline        ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2019-10-16T17:19:42,558][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://ip:9200/]}}
[2019-10-16T17:19:43,149][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://ip:9200/"}
[2019-10-16T17:19:43,320][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>6}
[2019-10-16T17:19:43,336][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>6}
[2019-10-16T17:19:43,399][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["http://ip:9200"]}
[2019-10-16T17:19:43,414][INFO ][logstash.outputs.elasticsearch] Using default mapping template
[2019-10-16T17:19:43,617][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date"}, "@version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2019-10-16T17:19:46,625][INFO ][logstash.inputs.file     ] No sincedb_path set, generating one based on the "path" setting {:sincedb_path=>"/logstash-6.8.0/data/plugins/inputs/file/.sincedb_490c48491a0fc19c7297104da7cfc991", :path=>["double-escaped path"]}
[2019-10-16T17:19:46,719][INFO ][logstash.pipeline        ] Pipeline started successfully {:pipeline_id=>"main", :thread=>"#<Thread:0x42c201c8 run>"}
[2019-10-16T17:19:46,872][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[2019-10-16T17:19:46,872][INFO ][filewatch.observingread  ] START, creating Discoverer, Watch with file and sincedb collections
[2019-10-16T17:19:47,814][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}
[2019-10-16T17:20:57,124][WARN ][logstash.runner          ] SIGINT received. Shutting down.
[2019-10-16T17:20:57,327][INFO ][filewatch.observingread  ] QUIT - closing all files and shutting down.
[2019-10-16T17:20:57,749][INFO ][logstash.pipeline        ] Pipeline has terminated {:pipeline_id=>"main", :thread=>"#<Thread:0x42c201c8 run>"}
[2019-10-16T17:20:57,749][INFO ][logstash.runner          ] Logstash shut down.

and I just noticed the "double escaped path". Thought now that this could be the problem, changed the path to a non-escaped string. Started again, and the log looks just the same, except for the path=> being "normally" escaped. It started [2019-10-16T17:22:02,449][WARN, ran up to [2019-10-16T17:22:20,949][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600} and is just sitting there doing nothing.

Procexp's performance graph confirms that logstash's just sitting there, in stand-by:

((edit))
30 minutes later, still nothing.

stephenb · October 16, 2019, 10:58pm

Interesting .... couple thoughts / questions

Can I assume this is on windows?

So perhaps to test just make a very simple read / write conf.

If so you can use forward slashes with windows path.
Also you can set the sincedb_path to NUL
You might want to clean out your sincedb ..
/logstash-6.8.0/data/plugins/inputs/file/.sincedb*
And perhaps let's just use default tail mode to test.
Then just run it and it should start streaming lines.

file {
        path => "C:/Users/sbrown/elastic/logstash-6.8.0/NOTICE.TXT"
        start_position => "beginning"
        sincedb_path => "NUL"
}

output {  
  stdout {codec => rubydebug}
}

I am also curious if the file is compressed or has an unusual new line delimiter.

Get that to work then go back and start setting the other settings / config

joaociocca · October 16, 2019, 11:26pm

I owe you a beer, @stephenb! =D

changing backslashes to forward slashes did the trick!

I thought, from docs I read, that start_position beginning was default for mode=>read... is mode not needed, then?

((edit))
Sorry, forgot to answer your curiosity! The files aren't compressed, and no weird new line delimiter is being used!

Just over 3 minutes running, 300k documents indexed and counting.

stephenb · October 16, 2019, 11:34pm

Glad I could help.

That's my standard debug config.

Now you can go back and try read mode, I think you do want to use read mode it will end when it's finished with the file otherwise it'll hang at the end looking for new lines.

Hmm default setting for start_position seems to be end.
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#plugins-inputs-file-start_position

joaociocca · October 16, 2019, 11:54pm

mode
Value can be either tail or read.
The default value is tail.
What mode do you want the file input to operate in. Tail a few files or read many content-complete files. Read mode now supports gzip file processing. If "read" is specified then the following other settings are ignored:

start_position (files are always read from the beginning)
close_older (files are automatically closed when EOF is reached)

start_position is ignored on read... maybe that's where I understood it was the same as beginning for read mode.

stephenb · October 16, 2019, 11:58pm

Ahh Makes sense... Missed that good to know thanks

system · November 13, 2019, 11:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Read mode input plugin and sincedb Logstash	1	384	March 24, 2020
Not able to read large csv file in logstash Logstash	3	1004	September 5, 2017
Fastest way to ingest CSV's with logstash to elasticsearch Logstash	9	473	June 8, 2023
Logstash is only partially reading input files Logstash	5	1859	August 7, 2017
Logstash file input not working for large file Logstash	1	305	June 24, 2018

Consuming 3x 2GB text logs

Related topics