CSV Logstash filter for big files

ELK_GB · November 27, 2017, 6:02pm

Hi Guys,

I have a big CSV file that has about 500+ csv columns, I managed to get them into the LS CSV filter, but those fields/columns will be dynamic. Meaning that sometimes we may get more/less columns produced by the source.

So my question is do I need to always manually have those columns available/changed in the CSV filter ?
is there any other way I can manage to have such big csv files parsed in LS without the columns always manually added to the filter ?

Another issue I have is that some times in LS logs i see when I have a big csv file, (500 columns and 2000 lines) lots of data, I get the limit of field mapping (index.mapping.total_fields.limit) has exceeded 1000. I read about it and was able to see different cases on here, but need to understand if this Is the limit of ALL the csv data fields in the file or just the header fields i.e. columns. The way I understand this is that its the number of mappings which means 500 in my case.

Thanks.

paz · November 28, 2017, 10:57am

Do the CSV files have a header row with all available columns? Also, do the columns stay the same inside the same CSV file?
If that's the case, you can probably use the autodetect columns flag in the CSV filter to automatically detect column names.

As for the total fields limit hit on ElasticSearch, I presume it has something to do with how ElasticSearch 5.x+ handles string fields by default.
If the majority of your fields are strings, each of them will be dynamically mapped to 2 separate fields (or more like, a field and a sub-field) and as such you may hit the 1,000 mapping limit with 500+ initial fields. You might need to set a template yourself to avoid this issue.

ELK_GB · November 28, 2017, 3:19pm

@paz Thank you for your response this is very helpful!

To answer your question yes, my file has header row.
Each csv files being generated will have the same structure but maybe different number of fields in the header row.

Unfortunately, all my fields are being mapped as string when LS parses them, this leads me to my second question will the auto-detect columns detect field type as well, or do I still need to do that manually ?

I guess the best way is to have a mapping template but with 500+ headers maybe a hard, unless there is an auto detect for field or a way to convert all at once.

paz · November 28, 2017, 3:46pm

autodetect_columns should work, though I haven't tried it myself with multiple worker pipelines (in the odd chance there are concurrency issues).

As for type autodetection, I wouldn't rely on possible Logstash or Elasticsearch type coercion capabilities, because it can lead to unexpected errors and issues, especially with a big amount of different fields, like e.g.:

Trying to perform numeric aggregations on ES (e.g. max or avg), but you find out that the field is mapped as string.
Have a field that may contain integer-alike strings in it's possible values. In that case, ES might automatically map it as integer if it's the first document in a new index, and have all subsequent documents fail due to a mapping parser exception.

Hard-coding a template might be tedious indeed, but could save you enough potential hassle down the line.
You could also pull a current mapping from an Elasticsearch index, change the type of any field you like and use it as a template, to work around making a huge template from the start.

ELK_GB · December 6, 2017, 3:18pm

@paz thanks again for the information.

system · January 3, 2018, 3:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash Filter CSV - Multiple Header Logstash	1	755	December 31, 2018
How to index CSV files with big number of columns/fields Elasticsearch	1	212	May 14, 2022
Logstash: CSV filter pattern-based field name detection from header row Logstash	6	2015	October 15, 2019
Struggle in export the data from csv to elasticsearch Logstash	5	1052	July 6, 2017
Is it possible skip columns from csv file while uploading data into elasticsearch? Logstash	5	1100	December 20, 2019

CSV Logstash filter for big files

Related topics