CSV Logstash filter for big files

Hi Guys,

I have a big CSV file that has about 500+ csv columns, I managed to get them into the LS CSV filter, but those fields/columns will be dynamic. Meaning that sometimes we may get more/less columns produced by the source.

So my question is do I need to always manually have those columns available/changed in the CSV filter ?
is there any other way I can manage to have such big csv files parsed in LS without the columns always manually added to the filter ?

Another issue I have is that some times in LS logs i see when I have a big csv file, (500 columns and 2000 lines) lots of data, I get the limit of field mapping (index.mapping.total_fields.limit) has exceeded 1000. I read about it and was able to see different cases on here, but need to understand if this Is the limit of ALL the csv data fields in the file or just the header fields i.e. columns. The way I understand this is that its the number of mappings which means 500 in my case.

Thanks.

Do the CSV files have a header row with all available columns? Also, do the columns stay the same inside the same CSV file?
If that's the case, you can probably use the autodetect columns flag in the CSV filter to automatically detect column names.

As for the total fields limit hit on ElasticSearch, I presume it has something to do with how ElasticSearch 5.x+ handles string fields by default.
If the majority of your fields are strings, each of them will be dynamically mapped to 2 separate fields (or more like, a field and a sub-field) and as such you may hit the 1,000 mapping limit with 500+ initial fields. You might need to set a template yourself to avoid this issue.

@paz Thank you for your response this is very helpful!

To answer your question yes, my file has header row.
Each csv files being generated will have the same structure but maybe different number of fields in the header row.

Unfortunately, all my fields are being mapped as string when LS parses them, this leads me to my second question will the auto-detect columns detect field type as well, or do I still need to do that manually ?

I guess the best way is to have a mapping template but with 500+ headers maybe a hard, unless there is an auto detect for field or a way to convert all at once.

autodetect_columns should work, though I haven't tried it myself with multiple worker pipelines (in the odd chance there are concurrency issues).

As for type autodetection, I wouldn't rely on possible Logstash or Elasticsearch type coercion capabilities, because it can lead to unexpected errors and issues, especially with a big amount of different fields, like e.g.:

  • Trying to perform numeric aggregations on ES (e.g. max or avg), but you find out that the field is mapped as string.
  • Have a field that may contain integer-alike strings in it's possible values. In that case, ES might automatically map it as integer if it's the first document in a new index, and have all subsequent documents fail due to a mapping parser exception.

Hard-coding a template might be tedious indeed, but could save you enough potential hassle down the line.
You could also pull a current mapping from an Elasticsearch index, change the type of any field you like and use it as a template, to work around making a huge template from the start.

@paz thanks again for the information.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.