Get position / offset of word in search query response


#1

Hi,

I start using ES and enjoying its features.
I'm a beginner and I have some use case and I hope ES can answer them !

Hope you can help me with my questions.
First, some context.
I have a document indexed with a title. When I defined my mappings, I also add additional field with French analyzer for the title as this:

PUT my_index
{
"mappings": {
"my_type": {
  "properties": {
	  "title": {
			"type": "text",
  		"fields": {
  			  "stem": {
  				"type": "text",
  				"analyzer": "french",
  			  }
			}
	  }
    }
  }
 }
}

Let's put some content:

PUT my_index/my_type/1
{
  "title": "Salut, ceci est un exemple en francais les amis."
}

Now I can easily search on my title or my title.stem:

GET my_index/my_type/_search
{
  "query": {
    "match": {
      "title.stem": "exempl"
   }
  }
}

Most of the time I will have to search a particular word in my title. ES can tell me if it's present on the document but (correct me if i'm wrong) can not say where in the title exactly (begin_offset / end_offset)
Here are my needs:

  • I want ES to tell me if the word "exempl" is present on the document AND where exaclty (position, offset). I know it's possible to have these informations via the term_vector (search the word in title, if present call term_vector on document and retrieve the position/offset of the word by looping on each result of term_vector) but is it possible to have those informations directly in one query ?
  • I also need to return two version of the title. The original and the "analyzed". I have the original title in the _source document. How can I have the "french analyzed" version ? I know about _analyze api, it can gave you the analyzed version of a given String but I would prefer to have it stored in my document (with original position, offset).

I don't know if all this is easily possible with ES or via some plugin, but I definitively need some advice on this.
Thanks for your help.


(Adrien Grand) #2

It is not possible to do everything in one request, but note that you can use _mtermvectors to get term vectors for all documents you are interested in in one go.

Elasticsearch does not store the analyzed version of a document anywhere, it is only used to compute the inverted index. If you want to be able to get the analyzed version of your documents, you would need to call the analyze API to post-process your results, or directly store the result of the call to the _analyze API in your documents at index time.


#3

Thanks for your reply.
Unfortunately, I was afraid it was the only solution.
If I'm correct, I can stored the title with "with_positions_offsets" option to boost the term_vector operation (right ?)
Is it working with "Analyzed title " ?


(Adrien Grand) #4

Correct. If your title field is very short (which is usually the case for titles), it will cost storage and not help much though. For your analyzed title, it will work out of the box if you store term vectors, but I think you will need to use the title (not title.stem) field and pass the analyzer of the title.stem field if you want to compute term vectors for title.stem on the fly.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.