I new to elasticsearch.
I have the following need : index tabular data (CSV), as whole documents : 1 CSV dataset = 1 document in ES.
As my datasets are quite big, I consider indexing a pre-computed synthesis of the data we already have, roughly :
list of the columns,
for each of the column : values (string), and frequency of each value in the column.
1st question I have :
I guess I need to tweak the TF/IDF computation in ES. I need to tell that the frequency of terms in the doc is not what ES will count but add a weight to that.
What's the good way to do it ?
2nd question :
In the search result, I want to know what columns the matched terms belong to. (for highlighting)
How can I achieve this ?
3rd question :
Do you see any caveat in this way of indexing data I should pay attention to ? Anything else I should customize in the indexation or search ?
Yes, sort of.
I need to find whole dataset (its ID) by terms that are in it. Dataset are big, like 10M rows, and I don't care about individual rows.
So I don't actually need to feed ES with the whole dataset, but with a list of terms related to this dataset ID.
Datasets are also preprocessed in Spark for other objectives, and I can easily precompute a synthesis of these files to feed the ES.
Well, if I have a column "foo" of 10M rows, having the term "bar" on 5M of them, this term is preponderant in this dataset.
But my data synthesis will be something like :
[ { column: foo, values: [ { value: bar, occurences: 5M}, { value:baz, occurences 1M} ... ], {column: ...} ... ]
So this document now have "bar" 1 time, no 5M. TF is killed if I do that, without indicating ES to take occurences into account.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.