Proximity searches - sentenses and paragraphs

(Anna Krupnik) #1

Hi there!

I have a problem with ElasticSearch - something that I don't quite understand how to approach.

We are trying to move from our old search engine (Hummingbird) to ElasticSearch - at the exploration stage at the moment.

We found a way to make almost all the queries that are now supported in our old search engine, except for the two:

  1. search within n sentences
  2. search within n paragraphs

It means that we, for example, want to search "clinton" and "trump" not farther than n sentences (paragraphs) from each other.

At first, we wanted to insert the additional tags - e.g. "snt_o" "snt_c" for sentences and "prg_o" "prg_c" for paragraphs and then use these tags in span queries. That actually worked.

Then we realized that by adding these tags we created even a bigger problem - that the queries like "within n words" no longer worked correctly. For example, if our user wanted to find words "clinton" and "trump" without any other words between them, he would not be able to find "clinton" at the end of one sentence and "trump" at the beginning of the other.

Then we had another idea: to put sentences into a multivalued field, that is split our document into comma-separated sentences and then to index it as a set of sentences. That also worked, but not quite entirely. When we got to highlighting our results, it turned out that only the sentences that matched our query were highlighted and the rest left out no matter what the settings were. Like this we are unable to reconstruct the whole document, which is essential for our solution.

I would very much appreciate if anyone could provide me with some ideas on how to deal with these problems, or perhaps tell me that what we want is impossible.

Thank you in advance!

(system) #2