Document design for indexing csv files


Hey, I have a vast amount of csv files that I want to index. My first idea how a mapping for those files could look like is a simple array, where each array element is a list of n strings, one for each column, so each array element is one record of the csv.
Unfortunately I have a lot of different schemes in my csv's. This would mean one mapping per scheme. While I am not sure if this approach is bad for performance, it is definitly bad for usability in my use case, where we filter a lot based on the document type (and we don't need such a fine granular document filter where we respect the different schemes)
So another approach I was thinking of is a nested array. The first one contains the records, as in the first approach. But instead of having a list of strings as elements we could use another array, which elements then would be a pair of strings: column_name and column_content. But in this case a query on the column names becomes much more complicated.
Actually I also thought of a third approach. One array for the records, but in this record ALL columns from ALL csv files and fill each one either with the real value if the csv has this column or a NULL value if it doesnt. But the number of distinct columns is just too big (around 200, whereas the typical csv has less than 10).

So do you have any ideas what mapping would be useful for me?

(Mark Walkom) #2

Are these all in the same place (on a filesystem or whatever) or are they logically separated, or can you someone logically separate them, even dynamically?
If it's the latter two then you could just have a bunch of different inputs with different CSV filters in each. But it's still a bunch of work to do.


Someone could seperate them, the filenames for the different "categories" follow the same pattern.
I think you are proposing to use Logstash with someting like:
filter {
csv {
columns => ["colA","colB","colC"]
separator => ","
...for every category.
But this is not what I want. I am looking for a clever mapping to reduce the overall number of mappings.
It would be perfect if a mapping can have optional fields. So I could have one mapping with 200 fields where 197 are optional and I don't have to store empty strings for csv's which do not have the respective column. Or otherwise to give a common name to all csv mappings. So internally I have mappings like csv1, csv2, csv3, ... but I can filter on the mapping csv which is an alias for an OR-concatenation of all csv_n mappings.


So anybody else has an idea?

(system) #5