I installed a wikipedia index with stream2es and wondered if it is possible to configure the index (similarity plugin?) in a way that allow to find wikipedia pages quoted in a query.
A sample request would look as follow:
GET /my_index/page/_search?pretty=true
{
"query" : {
"match" : {
"text":"quote from page 1; quote from page 1; quote from page 1"
}
}
}
Ideally, the first three hits of the response should point to page 1, 2 and 3. The quotes may potentially contains several sentences. WDYT, is this feasible? Is there already a feature that address such a requirement in elasticsearch or lucene?
Thank you for your answer. Ok, that's what I thought. I started building a plugin with a special tokenizer that use deduplication to address such a use case and wanted to check if I wasn't reinventing the wheel.
The problem I see in the solution you propose is that you must be aware of the semantic of the input document in order to build several queries. This is not always possible so my goal would be to avoid this step and limit as much as possible the number of queries.
A use case would be something similar to what Evernote does with the context feature. I write a note that contains some text and quotes several external sources. What I'm looking for is a solution that would be able to identify the source of these quotes even if the note does not contains the necessary semantic information.
What I wondered more specifically is if some of the similarity metrics (TF/IDF, BM25, DFR, etc.) provided by elastic search would help at solving such a use case?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.