Fetching position of keyword in matched document


(Minh Hoang, Nguyen) #1

Hello there,

I'm trying to highlight searched keyword in matched document with some custom works. Therefore, I need to know position (or offset) of that keyword in document. However, I found no documentation showing clearly how to do that. I know that when set "index_options" to "offsets" or "term_vector" to "with_positions_offsets", position for token will be generated and is stored together with token, but I don't know how to fetch those values.

Please give me some suggestions. Any help would be appreciated!


(Mark Harwood) #2

See term vectors API https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html


(Minh Hoang, Nguyen) #3

Thank you Mark for your support. It's useful but seems we need to do multi-steps to get those information, like:
Step 1: Get term vector of document
Step 2: Filter search keyword from result of step 1

So, in the case with query:

curl -XGET 'localhost:9200/my_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "fields": {
      "text": {} 
    }
  }
}

I would like to know if we can get offset position in result (for example in highlight, because elastic engine has ability to return highlighted text with pre & post tags, I assume that it knows text positions), so we don't need extra step.

{
  "took" : 113,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5063205,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "1",
        "_score" : 0.5063205,
        "_source" : {
          "text" : "Quick brown fox"
        },
        "highlight" : {
          "text" : [
            "Quick <em>brown</em> <em>fox</em>"
          ]
        }
      }
    ]
  }
}

Thank you again Mark!


(Mark Harwood) #4

You can supply custom markup tags e.g.instead of <em> you could have <somethingMyAppUnderstands> but this won't return the offsets.
I expect the most likely solution would be to implement a custom highlighter plugin (see example) because these are given the resources you need to get hold of the query tokens and the document contents. With some custom code you could return the required output.


(Minh Hoang, Nguyen) #5

Thank Mark for mentioning the plugin. I'm going to try it. Btw, I found this issue https://github.com/elastic/elasticsearch/issues/5736, still open for more than 3 years, hope next releases will implement this helpful feature.


(Minh Hoang, Nguyen) #6

Finally, by using search-highlighter plugin, I can get highlight text offset, by query for example:

curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "_source": ["content"],
    "query" : {
        "match_phrase" : { "content" : "test keyword" }
    },
    "highlight" : {
    	"pre_tags" : ["<b>"],
        "post_tags" : ["</b>"],
        "fields" : {
            "content" : {
            	"fragment_size" : 30, 
            	"number_of_fragments" : 10, 
            	"type": "experimental",
            	"options": {"return_offsets": true}
            }
        },
        "order" : "score"
    }
}
'

Thank you @Mark_Harwood for your help!


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.