Logstah performance issues

Hello,

I am so tired about Logstash problem.

I have total 2 850 000 messages every 15 minutes (3 document type differents), and On one of type, I have 2 hours of delays.

2 850 000 message every 15 minutes = 190 000 messages minutes

=> I have dedicated logstash serveur (24 cpu 24 Go RAM), 3 elasticsearch on cluster (each have 12 cpu 16 Go RAM) and dedicated Kibana.

Logstash configuration =

pipeline.workers: 24
pipeline.output.workers: 24
pipeline.batch.size: 250
pipeline.batch.delay: 5

I will think to search a better etl it's not normal this perf (also 3 shards and 1 replica by type)

You will probably get better performance by running multiple instances of Logstash.

This also depends on what your config is, what version of the stack you are on, your OS and even your JVM.

What does your Logstash configuration look like? What does your data look like? How much indexing throughput is your Elasticsearch cluster able to handle?

You are lamenting Logstash's performance without showing us how you have it configured. 190,000 messages per minute is only 3,167 messages per second. Unless they're extremely complex, and/or have heavy-duty enrichment going on, this should be achievable with that single server. There may be some constraints on I/O, but without knowing how you've configured anything, there's nothing we can do to assist you.

2 Likes

My logstash conf look like :

input {
  file {
    path => "/data/logstash/data/edr/P_ASK_*.txt"
    type => "edr"
    max_open_files => 30000
  }
}
filter {
  if [type] == "edr" {
    csv {
      columns => [ "Date","CTE","CT","SubsId","MastN","MDN","MCC","ID","ModFlag","ParentalControlFlag","RuleBase","FairUseFlag","Flows","Label","Notification","TotalOctets","Grantectets","AP","QoS-Rule","Usamit","Januomer" ]
    }
    if [message] =~ "\bCTE\b" {
      drop { }
    }
    mutate {
      remove_field => [ "message", "host", "path" ]
    }
    date {
      match => [ "Date" , "UNIX" ]
      remove_field => ["Date"]
    }
  }
}
output {
  if [type] == "edr" {
    elasticsearch {
      hosts => ["opm1zels01.com:9200","opm1zels02.com:9200","opm1zels03.com:9200"]
      index => "ed-%{+YYYY.MM.dd}"
    }
  }
}

If this is your complete configuration, and there are no other configuration files or inputs anywhere else in your pipeline, then you don't need either of the if [type] == "edr" { lines. Those conditionals would be checking every line, but every line would already be of type edr because of the type => "edr" line in your file input block.

Those conditionals will each add a small amount of latency for your messages.

This regular expression is reading each full line to find "\bCTE\b", which much more expensive in terms of processing time than to look for the value CTE in an individual field. You're already breaking down the csv into individual fields in the csv filter. Why check the entire message if the value will only be in a given field? This could be slowing things down dramatically.

If everything is truly CSV, then you could replace this with the dissect filter and get a non-trivial performance and throughput boost.

These are a few quick observations, in no particular order of relative or expected performance gain.

1 Like

Top thank you.

in fact, I have a header on each file. I do not see how to do what you're talking about
So, maybe like that :

if [CTE] = "CTENAME" { drop {} }
csv {
      columns => [ "Date","CTE","CT","SubsId","MastN","MDN","MCC","ID","ModFlag","ParentalControlFlag","RuleBase","FairUseFlag","Flows","Label","Notification","TotalOctets","Grantectets","AP","QoS-Rule","Usamit","Januomer" ]
    }

?

I don't know how to make that :frowning: and if it's installed (ELK 5.5)

Because it's a conditional, it should be:

if [CTE] == "CTENAME" { drop {} }

5.5 should have the dissect filter installed by default. The linked blog post in my earlier answer shows how to start configuring the dissect filter.

Ok thank you @theuntergeek
My config look like :

input {
  file {
    path => "/data/logstash/data/edr/P_ASK_*.txt"
    max_open_files => 30000
  }
}
filter {
    csv {
      columns => [ "Date","CTE","CT","SubsId","MastN","MDN","MCC","ID","ModFlag","Parental","RuleBase","Faig","Flows","Label","Notification","TotalOctets","Grantectets","AP","QoS-Rule","Usamit","Januomer" ]
    }

    if [CTE] == "CTENAME" { drop {} }

    mutate {
      remove_field => [ "message", "host", "path" ]
    }
    date {
      match => [ "Date" , "UNIX" ]
      remove_field => ["Date"]
    }
}
output {
    elasticsearch {
      hosts => ["opm1zels01.com:9200","opm1zels02.com:9200","opm1zels03.com:9200"]
      index => "ed-%{+YYYY.MM.dd}"
    }
}

Last thing, i must put "max_open_files => 30000 " in input because without, i have somme error message about max open file reached (even so ulimit = infiny ...)

So, that good like that @theuntergeek :

dissect {
      mapping => {
        "message" => "%{Date},%{CTE},%{CT},%{SubsId},%{MastDN},%{MDN},%{MCC},%{ID},%{ModFlag},%{ParentalControl},%{RBe},%{FFlag},%{Status},%{RedLabel},%{NotificatD},%{UsedT},%{TotalOctets},%{AP},%{List},%{Usmit},%{Janomer}"
      }
    }

What is :

%{priority} or %{?priority}

or the difference between %{CTE} and %{+CTE} (i think it's to add multiple field into one no ?)

Is a problem if i define date field (for example : CTE1,dd/MM/yyyy HH:mm:ss,CT12 ) like that : %{CTE},%{Date},%{CT} ?

With regards to the dissect filter, since your delimiter is always a comma, you shouldn't have to worry about using the ? or + in your field names.

You may want to ingest fewer files at once. I understand that you have many files you want to read in, but this message indicates you may be taxing Logstash by trying to open too many files at once. Try limiting the scope of your glob/wildcard and see if that helps.

I see, Dissect filter incorporates a field suppression section, it has better than a traditional mutate

So i can remove

mutate {
      remove_field => [ "message", "host", "path" ]
    }

TO

dissect {
      mapping => {
        "message" => "%{xxxxx}...."
      }
    remove_field => [ "message", "host", "path" ]
}

Right ? 44% of efficacity it's amazing

Events per second
Dissect	        16396	
Mutate (rename)	29248

EDIT : With DISSECT i have some error message type :

 Dissector mapping, key not found in event
 Dissector mapping, key not found in event
 Dissector mapping, key not found in event
 Dissector mapping, key not found in event
 Dissector mapping, key not found in event

I know what, i have one conf file by type of data (3 currently):

  1. logstash-ed.conf
  2. logstash-ca.conf
  3. logstash-ms.conf

By removing the "if type == xx", Lostash tries to apply Dissect filter to each conf I believe. I have not yet changed the others

I did warn about that in the beginning. You have such a strong server you might want to consider a separate pipeline for each configuration file. In 5.x, that means a separate instance of Logstash for each (not necessarily multiple installs, just one with 3 different configurations). In 6.0, you'll be able to define multiple pipelines within one instance.

Or you could go back to the conditionals the way you had it before.

Ok @theuntergeek ! but it's bad if i have 3 conf files instead of a single containing the 3 with conditionnal if?
What do you think about that ?

EDIT : we agree that there is no need to define an conditionnal if type in the output section, I send everything in elasticsearch In any case

If they are not bounded by conditionals, then you will export each line to elasticsearch 3 times.

Ok @theuntergeek Thank you vey much for your help ! You are a good man and you are my sensei now.

Logstash merges the files into one when it reads them in. You still would need conditionals or separate instances of Logstash.

@Beuhlet_Reseau
One thing to note about using Dissect instead of CSV is:
Dissect does not check for a , comma in a quoted section like the CSV filter does.
e.g. a message line like this

Adam Andrews, Beth Bell, "Cliff, Clive", Dave Dent 

and a dissect like this:

%{name_1}, %{name_2}, %{name_3}, %{name_4}, %{others}

will give (not what is expected)

name_1: Adam Andrews
name_2: Beth Bell
name_3: "Cliff
name_4: Clive"
others: Dave Dent

PROTIP 1: Always include a others or rest field at the end. Then check that this field is always empty - if its not, then your data has changed in some way. Output to a file or send an email or put it in redis.

PROTIP 2: use a named skip field if you know you don't need that data. e.g. %{?host}

Hello,

Excuse me but, i don't understand :blush: ., What's the aim ? Sometimes my fields are empty it's a problem ?

Ok, it's better to use %{?onething} or use

dissect {
      mapping => {
        "message" => "...|%{onething}" ]
      }
     remove_field => [ "%{onething}" ]
}