Searching for wikipedia quotes

bchapuis · October 2, 2015, 8:17pm

Hello,

I installed a wikipedia index with stream2es and wondered if it is possible to configure the index (similarity plugin?) in a way that allow to find wikipedia pages quoted in a query.

A sample request would look as follow:

GET /my_index/page/_search?pretty=true
{
  "query" :  { 
    "match" : {
      "text":"quote from page 1; quote from page 1; quote from page 1"
    }
  } 
}

Ideally, the first three hits of the response should point to page 1, 2 and 3. The quotes may potentially contains several sentences. WDYT, is this feasible? Is there already a feature that address such a requirement in elasticsearch or lucene?

Thanks in advance for your help,

Bertil

warkolm · October 4, 2015, 12:17am

You really need to send 3 distinct queries here, how does ES or Lucene know that you are expecting this to go across 3 pages?

bchapuis · October 5, 2015, 6:30am

Thank you for your answer. Ok, that's what I thought. I started building a plugin with a special tokenizer that use deduplication to address such a use case and wanted to check if I wasn't reinventing the wheel.

bchapuis · October 5, 2015, 10:49am

The problem I see in the solution you propose is that you must be aware of the semantic of the input document in order to build several queries. This is not always possible so my goal would be to avoid this step and limit as much as possible the number of queries.

A use case would be something similar to what Evernote does with the context feature. I write a note that contains some text and quotes several external sources. What I'm looking for is a solution that would be able to identify the source of these quotes even if the note does not contains the necessary semantic information.

What I wondered more specifically is if some of the similarity metrics (TF/IDF, BM25, DFR, etc.) provided by elastic search would help at solving such a use case?

Topic		Replies	Views
Some questions about Wikipedia river Elasticsearch	1	302	July 6, 2017
Search similar words in a big text Elasticsearch	3	537	July 6, 2017
Match when paragraph contains sentences from indexes Elasticsearch	1	850	March 6, 2020
Loading Wikipedia's Search Index For Testing - Sep. 2017 Update Elasticsearch	1	584	October 11, 2017
Which query is the best for standard searching? Elasticsearch	26	821	July 6, 2017

Searching for wikipedia quotes

Related topics