Indexing csv with header in elasticsearch

Hi all! I' m using Elasticsearch 6.6 and Kibana 6.4 installed on my Google Cloud account.
Here is my issue: I have a folder there where csv logs arrive from some IoT devices and with a Logstash pipeline that i created, i injest these csvs into an index in my Elasticsearch instance.
The point is that these csv are gonna be changed and will come with headers inside and i want in someway to include this extra information too and correlate it with the body of the corresponding csv. So, in the next step, when i perform a search with some keywords from the header, i want to be able to get results related to the body.
How can i do this?
Can anyone help me?

Thank you in advance!

Welcome!

Here is a (very old) tutorial which might help: http://david.pilato.fr/blog/2015/04/28/exploring-capitaine-train-dataset/

Note that there's now a CSV import tool in Kibana as well. That might be useful.

Hi dadoonet, and thanks for the quick response!
This is not exactly what i want, my fault, i didn't explain it clearly.
I want to include the header (not the default one, but the headers that the IoTs produce) of the csv in the injesting process. Below, i have attached a sample file to see the actual header that i mean.
The header i mean consists of the first 7 with "#" in front of: "Mode", "Operation", "CameraNo", "CudaStreamCameraNo", "SnapshotCamId", "CenterCamId", "AgronomistDose".

image

I moved your question to #logstash as the community there could find a better answer hopefully.

Please don't post unformatted code, logs, or configuration as it's very hard to read.

Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.

It would be great if you could update your post to solve this.

Hi @johnkary,

So your file is not a standard CSV file, only the #Body field.
For what I understood, with the header values, you want to include them for all the rows from body.

I believe you won't be able to use CSV plugin the way it is, perhaps if you ingest as a file and perform grok matches to distinguish if it's a header with # or a standard CSV line.

Wouldn't be easier for you to create a script (using python perhaps) to perform some manipulation/normalization on these CSV files before logstash to ingest?

Ok, well, suppose i created the pyhon script and split the csv into 2, one for it's body and one for header. Respectivelly, the next step is to use 2 different indexes, one for header and another one for body? And if i do that, how can i correlate these 2 indexes (include header values for all the rows from body) in order to get full results?

Why not transforming into a single CSV file with the header information into all the records?

E.g.

SystemId, Mode, ... , frameID, ...
00:04:4b:df:35:96, NORMAL, ... , 3, ...
00:04:4b:df:35:96, NORMAL, ... , 11, ...
00:04:4b:df:35:96, NORMAL, ... , 19, ...

With that format, you could use CSV filter plugin.

I have already done this and works fine for now, but potentially i'm gonna have trouble with memory as the actual csv that will come will have thousands or millions of records and the header is gonna have some extra fields, such as 'local_path' which is of 'longtext' type. So, if i put this header into all the records i will have serious redundancy of information.

I have also seen approaches like parent-child model or nested-object mapping, or application-side joins and data denormalization but i am a bit confused of figuring out which solution best fits to my case.
The crucial question here is: does any of the above 4 approaches eliminates data redundancy which occurs on @oranieri 's suggestion above?
Any ideas?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.