Summary: current understanding is that CSV file
-> FileBeat
-> Logstash
-> Logstash CSV filter
cannot reliably detect field names from column headings within CSV; workaround or alternative needed.
I have read the following documentation/discussions, amongst others:
- CSV plugin documentation: https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html
- Discuss topic: Reading the heading (1st line) of CSV file through logstash
Large CSV files with variable header information
We have multiple large CSV files containing stats generated by the applications we are monitoring. For example, a file records various stats for each process at intervals. Actual stats could include per-process user/system CPU time, sleep time, runqueue time etc. for the timeslice under consideration.
For the purposes of this discussion, let the letters A
through E
represent the stats. The list of processes varies from application to application, so that the CSV files produced will have variable column schemata.
Therefore, an example file recording 5 stats (A
-E
) for two processes (PROCESS1
and PROCESS2
) might look like this:
DATE,TIME,PROCESS1-A,PROCESS1-B,PROCESS1-C,PROCESS1-D,PROCESS2-E,PROCESS2-A,PROCESS2-B,PROCESS2-C,PROCESS2-D,PROCESS2-E
20190904,00:00:30,4,2,13,1,0,3,8,2,1,0
20190904,00:01:00,6,1,11,0,0,3,10,6,3,0
etc.
The actual CSV files may have hundreds of columns, which is also a problem, but outside the scope of this discussion.
Limitations of a mature enterprise environment
While the structure of the CSV file is not ideal, please note that:
- it is not possible to modify the generating application.
- it is not possible to infer the name/order of the columns from the CSV name/path, but only from the header contained within the file.
CSV filter autodetect_column_names: a fairly blunt tool
When autodetect_column_names
is enabled, Logstash treats the first line received after Logstash restart as containing header information. This works well only if the first line ingested is actually the header of the CSV; otherwise, values are used as field names, which is clearly undesirable.
For example, if the first document sent by FileBeat after Logstash restart is a data row, a field called 20190904
is generated by Logstash.
Headers changing within CSV file(!)
A further complication in our situation is that headers may change within the file; e.g. if a new process is added intraday, you might see this:
DATE,TIME,PROCESS1-A,PROCESS1-B,PROCESS1-C,PROCESS1-D,PROCESS2-E,PROCESS2-A,PROCESS2-B,PROCESS2-C,PROCESS2-D,PROCESS2-E
20190904,00:00:30,4,2,13,1,0,3,8,2,1,0
20190904,00:01:00,6,1,11,0,0,3,10,6,3,0
DATE,TIME,PROCESS1-A,PROCESS1-B,PROCESS1-C,PROCESS1-D,PROCESS2-E,PROCESS2-A,PROCESS2-B,PROCESS2-C,PROCESS2-D,PROCESS2-E,PROCESS2-E,PROCESS3-A,PROCESS3-B,PROCESS3-C,PROCESS3-D,PROCESS3-E
20190904,00:01:30,6,1,11,0,0,3,10,6,3,0,3,2,8,0,1
etc.
Questions
Logstash: recognize header rows based on a pattern?
Is there any way of causing Logstash to treat rows matching a certain pattern as header information, and everything else as a data row?
At present, this would be the preferred option, since it wouldn't require additional logic/processing on the FileBeat host, and would keep processing in Logstash. This may still present an issue if the first row sent is not a header row.
FileBeat: send header information?
Is there a way of causing FileBeat to send additional header information alongside each document representing a CSV row?
Alternatives?
Is there any alternative solution or workaround not considered here?