How logstash can get the total number of line in a file (count))


(alicia) #1

Hi,

I have 3 files CSV in windows, one is to store logs errors Error.csv, one is for all succes events success.csv and the last for all incomming events totalEvent.csv (= logs errors + succes events).

I want to have a graph in Kibana that show the total number of success events in an incomming events without show the details info of this success event.
So i want just to get the total number of line in the the success.csv and totalEvent.csv files in logstah and then send them to Kibana (by graph).

I have not found a plugin Logstash that can do this (get the total number of line in a file), any ideas for this?

Thanks very much.


(Paris Mermigkas) #2

If you just watch to visualize the amount of each type of event on Kibana, you don't really need that information be calculated from Logstash.

You can easily have that number by using a terms aggregation on Kibana (with terms being the event type), which would return the document count for each type.

There's also the Count API of ElasticSearch, but I do not know if it works with Kibana or not.

If however you still need to calculate those values on Logstash, you can probably use the metrics filter, which procudes separate events on regular intervals listing the total count of events passed, as well as rate and so on.


(alicia) #3

Hi @paz,

Thanks for your reply.

The terms aggregation and Count API works on the already indexed data in ElasticSearch
This means that all data of success.csv and totalEvent.csv must be indexed (we have a lot of lines in these files),
but i don't want ElasticSearch keeps in memory these data, so I am looking for a way to send only the amount of the file to Kibana.

I tested the metrics filter, it's not exactly what I want, because once the file is created (sucess.csv and totalevent.csv), there will be no new line added, so we do not need to check it periodically.

There is no other way to calculate this (the amount of a file in logstash)?

Thanks.


(Paris Mermigkas) #4

So you basically only read those files once every set period of time like 1 hour / 1 day, and you only need to practically count the amount of lines read?

To my knowledge there is no easy out-of-the-box way, because Logstash can't understand the notion of "reached the end of file" or similar (it's basically designed to handle a kind of constant stream of events).

One way would still be to utilize the metric filter with an appropriately long enough flush_time, so only 1 event is emitted after the file is processed.
Other solutions could involve using the aggregate filter or some custom Ruby code, but none of the above would be more straightforward than the metrics avenue.

Basically the problem is that there is no (easy) to send a signal to Logstash's pipeline that the file reached the end in order to take some arbitrary action. One would have to rely on ballpark estimations about when the process ends.


(alicia) #5

Hi @paz,

Thanks for your reply.

Yes, the sucess file is geneated every 20 minutes in a special repertory like /date-time/sucess.csv (because we have a job that lauche every 20 minutes).
so we have a logstash config to read all CSV file in such repository (/date-time/).

I tested metric filter, it's easier to use than others.

This means that I will have to use metric filter with an appropriately time. Can we set only 1 time for an metric filter (because the sucess file is unique and definitive, there is no new line added) ?

Thanks.


(Paris Mermigkas) #6

Yes, you can use a single time. A possible problem is that, because clear_interval and flush_interval start counting from the moment you start Logstash and not from the moment you start ingesting a new file, there might be cases where metrics are created in the middle of a file.

I don't know how your data are exactly, but you could use some form of timestamp identifier to separate metrics from different logs, like so:

filter {
  metrics {
    meter => "%{time}_%{field}_count"  # <-- 'time' is an appropriate timestamp (e.g. date of the CSV). 'field' is whatever field holds the event type (success, error, etc.)
    add_tag => "metrics"
    flush_interval => 1300  # <-- in seconds, 20 minutes + however long it takes to process one CSV file.
  }
}

Also, ideally, when you insert events into ElasticSearch you can supply your own document id, but it's slightly more complicated so try it that if the above fails.


(alicia) #7

Hi @paz,

Thanks.

I'll try this.
When i use metrics filter in the success file, all data line is loaded and kept in memory in ES?
Because we could have some large success files.

Thank you.


(Paris Mermigkas) #8

No, the actual data are not retained on memory. Just a hash map that keeps tracks of counters. So if you have few but large files, it should actually be more lightweight than having many but small files.

And you can discard all actual logs just before your Logstash ouput by using a proper condition, something like:

filter {
  if 'metrics' not in [tags] {
    drop {}
  }
}

output {
  ...
}

This way, only the metric events will end up in ElasticSearch.


(alicia) #9

Hi @paz,

Thanks for your reply.

I tested metric filter with a flush_interval (20 mins) and send it to Kibana, like this on a given file with 15 lines logs:
image

I have the count = 15 with metric filter, but the event is sent every 20 mins to ES for this sucess file, so every 20 mins i have a new event line (from metric filter) in Kibana console, how make metric filter sent just one event (or stop running after get one event) to ES in order to have just one line in Kibana console ?

Thanks.


(Paris Mermigkas) #10

Yeah, that's kind of the expected behavior. A new metric event will be emitted every 20 minutes as long as Logstash is running.

One way to work around this is to provide your own document ID in ElasticSearch output. That way, each time a new event is sent, it will just update the old one and not create another, leaving you with a single event. Like so:

output {
    elasticsearch {
        hosts => ["localhost:9200"]
        index => "test"
        document_id => "123456789"
    }
}

#11

How about an exec input that runs a line count against the file every 1200 seconds?


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.