Best practices to index a preprocessed CSV

Hi searching folks,

I new to elasticsearch.
I have the following need : index tabular data (CSV), as whole documents : 1 CSV dataset = 1 document in ES.

As my datasets are quite big, I consider indexing a pre-computed synthesis of the data we already have, roughly :

  • list of the columns,
  • for each of the column : values (string), and frequency of each value in the column.

1st question I have :
I guess I need to tweak the TF/IDF computation in ES. I need to tell that the frequency of terms in the doc is not what ES will count but add a weight to that.
What's the good way to do it ?

2nd question :
In the search result, I want to know what columns the matched terms belong to. (for highlighting)
How can I achieve this ?

3rd question :
Do you see any caveat in this way of indexing data I should pay attention to ? Anything else I should customize in the indexation or search ?

Thanks !

You mean the whole CSV file needs to be a single document in ES?


Yes, sort of.
I need to find whole dataset (its ID) by terms that are in it. Dataset are big, like 10M rows, and I don't care about individual rows.

So I don't actually need to feed ES with the whole dataset, but with a list of terms related to this dataset ID.

Datasets are also preprocessed in Spark for other objectives, and I can easily precompute a synthesis of these files to feed the ES.

Well, if I have a column "foo" of 10M rows, having the term "bar" on 5M of them, this term is preponderant in this dataset.
But my data synthesis will be something like :
[ { column: foo, values: [ { value: bar, occurences: 5M}, { value:baz, occurences 1M} ... ], {column: ...} ... ]
So this document now have "bar" 1 time, no 5M. TF is killed if I do that, without indicating ES to take occurences into account.

Am i correct ?