Searching for wikipedia quotes


(Bertil Chapuis) #1

Hello,

I installed a wikipedia index with stream2es and wondered if it is possible to configure the index (similarity plugin?) in a way that allow to find wikipedia pages quoted in a query.

A sample request would look as follow:

GET /my_index/page/_search?pretty=true
{
  "query" :  { 
    "match" : {
      "text":"quote from page 1; quote from page 1; quote from page 1"
    }
  } 
}

Ideally, the first three hits of the response should point to page 1, 2 and 3. The quotes may potentially contains several sentences. WDYT, is this feasible? Is there already a feature that address such a requirement in elasticsearch or lucene?

Thanks in advance for your help,

Bertil


(Mark Walkom) #2

You really need to send 3 distinct queries here, how does ES or Lucene know that you are expecting this to go across 3 pages?


(Bertil Chapuis) #3

Thank you for your answer. Ok, that's what I thought. I started building a plugin with a special tokenizer that use deduplication to address such a use case and wanted to check if I wasn't reinventing the wheel. :smile:


(Bertil Chapuis) #4

The problem I see in the solution you propose is that you must be aware of the semantic of the input document in order to build several queries. This is not always possible so my goal would be to avoid this step and limit as much as possible the number of queries.

A use case would be something similar to what Evernote does with the context feature. I write a note that contains some text and quotes several external sources. What I'm looking for is a solution that would be able to identify the source of these quotes even if the note does not contains the necessary semantic information.

What I wondered more specifically is if some of the similarity metrics (TF/IDF, BM25, DFR, etc.) provided by elastic search would help at solving such a use case?


(system) #5