Logstash 8.6 low performance

Hi there
I ha a server with Linux Ubuntu 20.04 and ELK 8.6
I noticed that the ingestion proccess became slow and I have not change any parameters.
This is the conf file for theindex.

input {
        file {
                path => "/opt/trabaja/csv/trx_hours*.csv"
                start_position => "beginning"
                sincedb_path=> "NULL"
                mode => "read"
                file_completed_action => "delete"
                file_sort_by => "path"
                exit_after_read => true
             }
        }
filter {
               csv { separator  => ";"
              columns => ["nodo","fecha","usuario","serviceid","sesid","base","sp","trn","ssn","tiempo","tiempo_sp","observacion"]}

mutate  { remove_field => [ "message", "@version","host","[log][file][path]" ] }
mutate { convert => [ "fecha", "string" ]  }
mutate { convert => [ "usuario", "string" ]  }
mutate { convert => [ "serviceid", "string" ]  }
mutate { convert => [ "sesid", "string" ]  }
mutate { convert => [ "base", "string" ]  }
mutate { convert => [ "sp", "string" ]  }
mutate { convert => [ "trn", "float" ]  }
mutate { convert => [ "ssn", "float" ]  }
mutate { convert => [ "tiempo", "float" ]  }
mutate { convert => [ "tiempo_sp", "float" ]  }
mutate { convert => [ "observacion", "string" ]  }
mutate { add_field => { "fecha_dia" => "%{fecha}" } }
date {
match => [ "fecha_dia", "yyyy-MM-dd HH:mm:ss.SSS" ,"ISO8601"]
timezone => "America/Argentina/Buenos_Aires"
target => "@timestamp"
}
mutate  { remove_field => [ "fecha_dia" ]}
}

output{
    elasticsearch {  hosts => ["localhost:9200"]
                     index => "trx_hours_new"
                     user => "elastic"
                     password => "Accusys123*"
                     retry_on_conflict => 0 }
                     stdout { }
    }

logstash jvm.options 8gb
Any help please.
Thanks in advance.

Since you are using Linux, set sincedb_path => "/dev/null". The param: sincedb_path=> "NUL" is for Windows. Your .conf looks simple for processing should be fast.
Can you provide more details:

  • how slow is, how many messages/event process per minute?
  • does the /opt/trabaja/csv/ directory have a lot of files?
  • what is file size?
  • you are deleting file after read, am I right?
  • why do you need stdout { }? It's debug and consume resources
  • any particular reason for retry_on_conflict => 0?
  • have you check with the API or in Kibana Stack monitoring which plugin consume the most resources
  • how slow is, how many messages/event process per minute?
    60k per minute
  • does the /opt/trabaja/csv/ directory have a lot of files?
    Yes, with different sizes, the less value 8MB the biggest 1.7GB
  • you are deleting file after read, am I right?
    Yes the files are deteted.
  • why do you need stdout { }? It's debug and consume resources
    I will take it out.
  • any particular reason for retry_on_conflict => 0?
    None, I will delete the line
  • have you check with the API or in Kibana Stack monitoring which plugin consume the most resources
    Could you give more hints about the kibana, which plugin do I have to use for monitoring

Thanks for your time and recommendations

1 Like

Following your recommendatios the file.conf

input {
        file {
                path => "/opt/trabaja/csv/trx_hours*.csv"
                start_position => "beginning"
                sincedb_path=> "NULL"
                mode => "read"
                file_completed_action => "delete"
                file_sort_by => "path"
                exit_after_read => true
             }
        }
filter {
               csv { separator  => ";"
              columns => ["nodo","fecha","usuario","serviceid","sesid","base","sp","trn","ssn","tiempo","tiempo_sp","observacion"]}

mutate  { remove_field => [ "message", "@version","host","[log][file][path]" ] }
mutate { convert => [ "fecha", "string" ]  }
mutate { convert => [ "usuario", "string" ]  }
mutate { convert => [ "serviceid", "string" ]  }
mutate { convert => [ "sesid", "string" ]  }
mutate { convert => [ "base", "string" ]  }
mutate { convert => [ "sp", "string" ]  }
mutate { convert => [ "trn", "float" ]  }
mutate { convert => [ "ssn", "float" ]  }
mutate { convert => [ "tiempo", "float" ]  }
mutate { convert => [ "tiempo_sp", "float" ]  }
mutate { convert => [ "observacion", "string" ]  }
mutate { add_field => { "fecha_dia" => "%{fecha}" } }
date {
match => [ "fecha_dia", "yyyy-MM-dd HH:mm:ss.SSS" ,"ISO8601"]
timezone => "America/Argentina/Buenos_Aires"
target => "@timestamp"
}
mutate  { remove_field => [ "fecha_dia" ]}
}

output{
    elasticsearch {  hosts => ["localhost:9200"]
                     index => "trx_hours_new"
                     user => "elastic"
                     password => "Accusys123*"
                     action =>"index"       }
    }

Now is indexing 388.000 lines per minute.
I really appreciate you sharing your great experience.
Thanks

For more then hundred replays, this is for the first time someone clearly responded to my questions.
Dear Elastic Team Members, can you please provide a gift to him? A t-shirt, mug, pen, anything?

Back to the topic. The main cause of slow processing is debug. Your .conf is not complex. I have removed lines with conversion to string - default is string, no need AFAIK, and removed few lines and fecha_dia if you need only for date convers. Try also with the dissect plugin, maaaaaybe you will get a little bit more on performances. If you don't need the event field, you can removed.
In the file plugin, you should use:

  • Windows: sincedb_path => "NUL"
  • Linux: sincedb_path => "/dev/null", in your case
    You can set real path and file in sincedb_path to track which files are processes. It's up to you.
input {
  file {
                path => "/opt/trabaja/csv/trx_hours*.csv"
                start_position => "beginning"
                sincedb_path => "/dev/null"
                mode => "read"
                file_completed_action => "delete"
                file_sort_by => "path"
                exit_after_read => true
  }
}
		
filter {
  csv { separator  => ";"
       columns => ["nodo","fecha","usuario","serviceid","sesid","base","sp","trn","ssn","tiempo","tiempo_sp","observacion"]
  }
  
  # dissect {
	# mapping => { "message" => "%{nodo};"%{fecha}";"%{usuario}";"%{serviceid}";"%{sesid}";"%{base}";"%{sp}";"%{trn}";"%{ssn}";"%{tiempo}";"%{tiempo_sp}";"%{observacion}"  }
  # }
   
mutate  { remove_field => [ "message", "host","log", "event" ] }

mutate { convert => [ "trn", "float" ]  }
mutate { convert => [ "ssn", "float" ]  }
mutate { convert => [ "tiempo", "float" ]  }
mutate { convert => [ "tiempo_sp", "float" ]  }

  date {
	match => [ "fecha", "yyyy-MM-dd HH:mm:ss.SSS" ,"ISO8601"]
	timezone => "America/Argentina/Buenos_Aires"
	# target => "@timestamp" # no need, @timestamp it's default destination field.
  }
}

If you like to use only the CSV plugin, you can do conversion inside csv.

csv { 
  separator  => ";"
  convert => {
          "trn" => "float"
          "ssn" => "float"
          "tiempo" => "float"
          "tiempo_sp" => "float" 
  }
  columns => [... 
}

LS can provide processing info. Here you can find more info.
For LS<8.x you can use monitoring integrated inside, add in logstash.yml:

xpack.monitoring.enabled: true
xpack.monitoring.collection.interval: 5s
xpack.monitoring.collection.pipeline.details.enabled: true
xpack.monitoring.elasticsearch.hosts: "localhost:9200"

For LS 8.x you should use Metricbeat and the logstash module to see info in Kibana. With Stack Monitoring enabled, even without Metricbeat you should see some metrics in Kibana. Also when LS is running you can use curl http://localhost:9600/_node/stats/pipelines?pretty to see perfomances in JSON format.

@Rios
Finally this is the .conf file.

input {
  file {
                path => "/opt/trabaja/csv/trx_hours*.csv"
                start_position => "beginning"
                sincedb_path => "NULL"
                mode => "read"
                file_completed_action => "delete"
                file_sort_by => "last_modified"
                exit_after_read => true
  }
}
filter {
  csv { separator  => ";"

convert => {
          "trn" => "float"
          "ssn" => "float"
          "tiempo" => "float"
          "tiempo_sp" => "float"
}
       columns => ["nodo","fecha","usuario","serviceid","sesid","base","sp","trn","ssn","tiempo","tiempo_sp","observacion"]
  }

mutate  { remove_field => [ "message", "@version","host","[log][file][path]" ] }

date {
   match => [ "fecha_dia", "yyyy-MM-dd HH:mm:ss.SSS" ,"ISO8601"]
   timezone => "America/Argentina/Buenos_Aires"
   target => "@timestamp"
}
mutate  { remove_field => [ "fecha_dia" ]}
}
output{
    elasticsearch {  hosts => ["localhost:9200"]
                     index => "trx_hours_new"
                     user => "elastic"
                     password => "Accusys123*"
                     action =>"index"       }
    }

Now is indexing 800K per minute.
I could not not take away this section

date {
   match => [ "fecha_dia", "yyyy-MM-dd HH:mm:ss.SSS" ,"ISO8601"]
   timezone => "America/Argentina/Buenos_Aires"
   target => "@timestamp"
}
mutate  { remove_field => [ "fecha_dia" ]}
}

Because @timestamp takes the actual date when is running.
Nevertheless, it was a very good help
Thanks a lot!!
P.S. I am waiting for my gift.

I cannot see where was copied fecha to fecha_dia in this .conf

Also, you can set without coping:

date {
   match => [ "fecha", "yyyy-MM-dd HH:mm:ss.SSS" ,"ISO8601"]
   timezone => "America/Argentina/Buenos_Aires"
   target => "@timestamp"
}

@Rios
Yes I missed that line in th code; but was there.

Thank you very much for all the suggestions.

@Rios
You were right, there was not need to create the field fecha_dia to get the right datetime.

date {
   match => [ "fecha", "yyyy-MM-dd HH:mm:ss.SSS" ,"ISO8601"]
   timezone => "America/Argentina/Buenos_Aires"
   target => "@timestamp"
}

I already changed all the .conf.
Best regards.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.