Logstash: CSV filter pattern-based field name detection from header row

Summary: current understanding is that CSV file -> FileBeat -> Logstash -> Logstash CSV filter cannot reliably detect field names from column headings within CSV; workaround or alternative needed.

I have read the following documentation/discussions, amongst others:

Large CSV files with variable header information

We have multiple large CSV files containing stats generated by the applications we are monitoring. For example, a file records various stats for each process at intervals. Actual stats could include per-process user/system CPU time, sleep time, runqueue time etc. for the timeslice under consideration.

For the purposes of this discussion, let the letters A through E represent the stats. The list of processes varies from application to application, so that the CSV files produced will have variable column schemata.

Therefore, an example file recording 5 stats (A-E) for two processes (PROCESS1 and PROCESS2) might look like this:

DATE,TIME,PROCESS1-A,PROCESS1-B,PROCESS1-C,PROCESS1-D,PROCESS2-E,PROCESS2-A,PROCESS2-B,PROCESS2-C,PROCESS2-D,PROCESS2-E
20190904,00:00:30,4,2,13,1,0,3,8,2,1,0
20190904,00:01:00,6,1,11,0,0,3,10,6,3,0

etc.

The actual CSV files may have hundreds of columns, which is also a problem, but outside the scope of this discussion.

Limitations of a mature enterprise environment

While the structure of the CSV file is not ideal, please note that:

  • it is not possible to modify the generating application.
  • it is not possible to infer the name/order of the columns from the CSV name/path, but only from the header contained within the file.

CSV filter autodetect_column_names: a fairly blunt tool

When autodetect_column_names is enabled, Logstash treats the first line received after Logstash restart as containing header information. This works well only if the first line ingested is actually the header of the CSV; otherwise, values are used as field names, which is clearly undesirable.

For example, if the first document sent by FileBeat after Logstash restart is a data row, a field called 20190904 is generated by Logstash.

Headers changing within CSV file(!)

A further complication in our situation is that headers may change within the file; e.g. if a new process is added intraday, you might see this:

DATE,TIME,PROCESS1-A,PROCESS1-B,PROCESS1-C,PROCESS1-D,PROCESS2-E,PROCESS2-A,PROCESS2-B,PROCESS2-C,PROCESS2-D,PROCESS2-E
20190904,00:00:30,4,2,13,1,0,3,8,2,1,0
20190904,00:01:00,6,1,11,0,0,3,10,6,3,0
DATE,TIME,PROCESS1-A,PROCESS1-B,PROCESS1-C,PROCESS1-D,PROCESS2-E,PROCESS2-A,PROCESS2-B,PROCESS2-C,PROCESS2-D,PROCESS2-E,PROCESS2-E,PROCESS3-A,PROCESS3-B,PROCESS3-C,PROCESS3-D,PROCESS3-E
20190904,00:01:30,6,1,11,0,0,3,10,6,3,0,3,2,8,0,1

etc.

Questions

Logstash: recognize header rows based on a pattern?

Is there any way of causing Logstash to treat rows matching a certain pattern as header information, and everything else as a data row?

At present, this would be the preferred option, since it wouldn't require additional logic/processing on the FileBeat host, and would keep processing in Logstash. This may still present an issue if the first row sent is not a header row.

FileBeat: send header information?

Is there a way of causing FileBeat to send additional header information alongside each document representing a CSV row?

Alternatives?

Is there any alternative solution or workaround not considered here?

Not in a csv filter, but you could implement this in a ruby filter. You would need to preserve event ordering, which requires '--pipeline.workers 1' and (for now) '--java-execution false'

Thank you @Badger. No doubt you are right that a Ruby filter would be a viable way of achieving this, although I expect I will have to research/experiment extensively to get to that point.

Legacy Ruby engine

You mentioned that java-execution would need to be set to false. The command line options documentation states:

--java-execution: specify false for this option to revert to the legacy Ruby execution engine instead of the default Java execution engine.

Could you please explain why use of the legacy engine is required? I couldn't see anything in the Ruby plugin documentation which mentions this option or even Java at all, but I may have missed something.

I expect use of that legacy engine may negatively impact performance, but does it also remove/alter functionality? I have read the Java Execution Engine introductory blog post, but was unable to answer my own questions.

Beyond those considerations, I am reluctant to rely on the legacy engine at all; according to the blog post:

the Ruby execution engine is slated for eventual removal

pipeline.workers: 1

I see why setting pipeline.workers to 1 is necessary at present, i.e. to preserve event ordering. I hope that won't cause performance issues once this goes to production, where data volumes may be quite high.

The java execution engine re-orders events, even with a single pipeline worker. elastic will fix this, we just do not know when. I cannot imagine that they would remove the old engine before fixing it.

1 Like

Hello,
This is interesting, but while it might be possible to do a naive implementation using ruby filters for example, this would not work as soon as you'll have issues with the order of events, as stated by @Badger :

  • multiple Filebeats
  • one Logstash instance uses multiple workers which you might limit to 1 (but it would become a bottleneck
  • multiple Logstash instances

In my opinion, the most reliable solution would be to ensure every event has the field names at source.
If you cannot modify the producer, the job of converting your "proprietary format" (it's CSV, but you're introducing headers within the same file) to a self contained format (where each event contains both the field name and its value) should be done by Filebeat or an external application.

Filebeat supports decoding CSV into array of values using decode_csv_fields processor, but it's not able to interpret CSV Headers. It doesn't fully satisfies your needs as you cannot infer the headers by the type of process unfortunately...

I might suggest:

  • Develop your own codec/beat (see the community beats)
  • Implement a periodic process which takes the raw logs, reads them locally and writes a JSON self-descriptive (each line contains the field names and values) event per line. Then you’ll be able to ingest them via Filebeat

I cannot find alternative solutions at the moment which might work out of the box.
Hope you'll find this information useful.

Cheers,
Luca

Thank you @Luca_Belluccini.

After reading this, I agree adding the information at source - in my case, FileBeat and/or a custom Beat - is the way forward here. Otherwise, as you point out, by the time the document gets to Logstash, required information cannot be derived.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.