Best practices to index a preprocessed CSV

matd · May 23, 2016, 8:11pm

Hi searching folks,

I new to elasticsearch.
I have the following need : index tabular data (CSV), as whole documents : 1 CSV dataset = 1 document in ES.

As my datasets are quite big, I consider indexing a pre-computed synthesis of the data we already have, roughly :

list of the columns,
for each of the column : values (string), and frequency of each value in the column.

1st question I have :
I guess I need to tweak the TF/IDF computation in ES. I need to tell that the frequency of terms in the doc is not what ES will count but add a weight to that.
What's the good way to do it ?

2nd question :
In the search result, I want to know what columns the matched terms belong to. (for highlighting)
How can I achieve this ?

3rd question :
Do you see any caveat in this way of indexing data I should pay attention to ? Anything else I should customize in the indexation or search ?

Thanks !
Mathieu

warkolm · May 24, 2016, 8:40pm

You mean the whole CSV file needs to be a single document in ES?

Why?

matd · May 24, 2016, 9:59pm

Yes, sort of.
I need to find whole dataset (its ID) by terms that are in it. Dataset are big, like 10M rows, and I don't care about individual rows.

So I don't actually need to feed ES with the whole dataset, but with a list of terms related to this dataset ID.

Datasets are also preprocessed in Spark for other objectives, and I can easily precompute a synthesis of these files to feed the ES.

Well, if I have a column "foo" of 10M rows, having the term "bar" on 5M of them, this term is preponderant in this dataset.
But my data synthesis will be something like :
[ { column: foo, values: [ { value: bar, occurences: 5M}, { value:baz, occurences 1M} ... ], {column: ...} ... ]
So this document now have "bar" 1 time, no 5M. TF is killed if I do that, without indicating ES to take occurences into account.

Am i correct ?

Topic		Replies	Views
Creating a tf-idf for unigrams in a corpus using elasticsearch? Elasticsearch	2	505	July 6, 2017
How to index a csv file with Elastic Search Elasticsearch	7	1145	July 6, 2017
Accessing tf-idf Elasticsearch	12	6756	July 6, 2017
Elastic Search Tokenizer (for tf-idf) Elasticsearch	8	763	July 6, 2017
Compute TF/IDF across indexes Elasticsearch	5	2155	July 6, 2017

Best practices to index a preprocessed CSV

Related topics