Consuming 3x 2GB text logs

so, I have this huge files. I tried using read mode file input, but nothing happened, I tried piping the content (like I've learned from @dadoonet over here https://www.elastic.co/pt/blog/enriching-your-postal-addresses-with-the-elastic-stack-part-1), but since the files are so huge I'm running out of memory even before the pipe can start logstash.

Any ideas?

Hi @joaociocca

Can you please post the error messages if any and format them with the formatter </> button.

Also post your logstash config as well (formatted using </> )

Large files should not really be a problem... Sometimes I just cat off the first 1000 rows or so and get it working before I try the big file.

Logstash can take a while to start...

I left it running read mode for almost an hour, with nothing showing up on console past logstash's startup lines.
When I tried pipe, after some five minutes the process already had eaten up twice my RAM available.

I've just got home, tomorrow when I get back I'll post the config, but it's a simple file input, CSV filter, elasticsearch output conf.

When building, I ran some 10 lines through with Ruby debug output, and it was fine.
((edit))

here's the simplified w/o comments conf:

input {  
  file {
    path => "(correct file path with escaped \ where needed)"
    mode => ["read"]
  }
}
filter {  
  if [message] =~ "^#" {
    drop {}
  }
  mutate {
    gsub => ["message","-",""]
  }
  csv {
    columns => ["c-ip","cs-username","c-agent","sc-authenticated","date","time","s-computername","cs-referred","r-host","r-ip","r-port","time-taken","sc-bytes","cs-bytes","cs-protocol","s-operation","cs-uri","cs-mime-type","s-object-source","sc-status","rule","FilterInfo","sc-network","error-info","action","GMT Time","AuthenticationServer","ThreatName","UrlCategory","MalwareInspectionContentDeliveryMethod","MalwareInspectionDuration","internal-service-info","NIS application protocol","UrlCategorizationReason","SessionType","UrlDestHost","s-port","SoftBlockAction"]
    separator => "	"
  }
  date {
    match => [ "GMT Time", "YYYYMMdd HH:mm:ss" ]
      timezone => "America/Sao_Paulo"
  }
  if [bytesSent] {
    ruby {
      code => "event['kilobytesSent'] = event['bytesSent'].to_i / 1024.0"
    }
  }
  if [bytesReceived] {
    ruby {
      code => "event['kilobytesReceived'] = event['bytesReceived'].to_i / 1024.0"
    }
  }
  mutate {
    convert => ["bytesSent", "integer"]
    convert => ["bytesReceived", "integer"]
    convert => ["timetaken", "integer"]
    add_field => { "clientHostname" => "%{r-ip}" }
    remove_field => [ "GMT Time"]
  }
  dns {
    action => "replace"
    reverse => ["clientHostname"]
  }
    useragent {
        source=> "useragent"
        prefix=> "browser"
    }
}
output {  
  elasticsearch{
    hosts => ["http://ip:9200"]
    index => "index-%{+YYYY.MM.dd}"
  }
  # stdout {codec => rubydebug}
}

As I've said, testing with rubydebug stdout with the first 5-20 lines of each file shows the correct result is being processed.

I just started it again and here's the log:

Sending Logstash logs to /logstash-6.8.0/logs which is now configured via log4j2.properties
[2019-10-16T17:19:22,398][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2019-10-16T17:19:22,460][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.8.0"}
[2019-10-16T17:19:41,384][INFO ][logstash.pipeline        ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2019-10-16T17:19:42,558][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://ip:9200/]}}
[2019-10-16T17:19:43,149][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://ip:9200/"}
[2019-10-16T17:19:43,320][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>6}
[2019-10-16T17:19:43,336][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>6}
[2019-10-16T17:19:43,399][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["http://ip:9200"]}
[2019-10-16T17:19:43,414][INFO ][logstash.outputs.elasticsearch] Using default mapping template
[2019-10-16T17:19:43,617][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date"}, "@version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2019-10-16T17:19:46,625][INFO ][logstash.inputs.file     ] No sincedb_path set, generating one based on the "path" setting {:sincedb_path=>"/logstash-6.8.0/data/plugins/inputs/file/.sincedb_490c48491a0fc19c7297104da7cfc991", :path=>["double-escaped path"]}
[2019-10-16T17:19:46,719][INFO ][logstash.pipeline        ] Pipeline started successfully {:pipeline_id=>"main", :thread=>"#<Thread:0x42c201c8 run>"}
[2019-10-16T17:19:46,872][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[2019-10-16T17:19:46,872][INFO ][filewatch.observingread  ] START, creating Discoverer, Watch with file and sincedb collections
[2019-10-16T17:19:47,814][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}
[2019-10-16T17:20:57,124][WARN ][logstash.runner          ] SIGINT received. Shutting down.
[2019-10-16T17:20:57,327][INFO ][filewatch.observingread  ] QUIT - closing all files and shutting down.
[2019-10-16T17:20:57,749][INFO ][logstash.pipeline        ] Pipeline has terminated {:pipeline_id=>"main", :thread=>"#<Thread:0x42c201c8 run>"}
[2019-10-16T17:20:57,749][INFO ][logstash.runner          ] Logstash shut down.

and I just noticed the "double escaped path". Thought now that this could be the problem, changed the path to a non-escaped string. Started again, and the log looks just the same, except for the path=> being "normally" escaped. It started [2019-10-16T17:22:02,449][WARN, ran up to [2019-10-16T17:22:20,949][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600} and is just sitting there doing nothing.

Procexp's performance graph confirms that logstash's just sitting there, in stand-by:
image

((edit))
30 minutes later, still nothing.

Interesting .... couple thoughts / questions

Can I assume this is on windows?

So perhaps to test just make a very simple read / write conf.

If so you can use forward slashes with windows path.
Also you can set the sincedb_path to NUL
You might want to clean out your sincedb ..
/logstash-6.8.0/data/plugins/inputs/file/.sincedb*
And perhaps let's just use default tail mode to test.
Then just run it and it should start streaming lines.

file {
        path => "C:/Users/sbrown/elastic/logstash-6.8.0/NOTICE.TXT"
        start_position => "beginning"
        sincedb_path => "NUL"
}

output {  
  stdout {codec => rubydebug}
}

I am also curious if the file is compressed or has an unusual new line delimiter.

Get that to work then go back and start setting the other settings / config

1 Like

I owe you a beer, @stephenb! =D

changing backslashes to forward slashes did the trick!

I thought, from docs I read, that start_position beginning was default for mode=>read... is mode not needed, then?

((edit))
Sorry, forgot to answer your curiosity! The files aren't compressed, and no weird new line delimiter is being used!

Just over 3 minutes running, 300k documents indexed and counting.

1 Like

Glad I could help.

That's my standard debug config.

Now you can go back and try read mode, I think you do want to use read mode it will end when it's finished with the file otherwise it'll hang at the end looking for new lines.

Hmm default setting for start_position seems to be end.
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#plugins-inputs-file-start_position

1 Like

mode
Value can be either tail or read.
The default value is tail.
What mode do you want the file input to operate in. Tail a few files or read many content-complete files. Read mode now supports gzip file processing. If "read" is specified then the following other settings are ignored:

start_position (files are always read from the beginning)
close_older (files are automatically closed when EOF is reached)

start_position is ignored on read... maybe that's where I understood it was the same as beginning for read mode.

1 Like

Ahh Makes sense... Missed that good to know thanks

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.