Highlighting within fragments inconsistent


#1

I'm trying to use ElasticSearch to locate important text which I then run a second process to extract information from the highlighted fragment.

The problem is that in these highlighted fragments, the text I need to extract is often after the word that I'm searching for. Eg. In the document - "Numbers: 1 2 3 Colours: red green blue", I'm searching for "colours". The result I see is "3 Colours"

With ElasticSearch it seems like the fragment often finishes after "colours".

I have tried modifying the fragment size and it only increases the text BEFORE my search term and none after it.
Is there a way to make sure that "colour" would be in the middle of the fragment?


(David Pilato) #2

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Which version are you using?


#3

We are using ElasticSearch v6.3.2 on python 3

Here is the query I'm running:

query = {
    "query": {
	    "bool": {
		    "must": {
			    "query_string" : {"default_field": "content", "query"  : "\"paint colours\" OR \"pc\""}
		    },
		    "filter": {
			    "term": {"_id": doc_name}
		     }
        }
    },
    "highlight": {
        "pre_tags" : ["<start>"],
        "post_tags" : ["</end>"],
        "fields": {
            "content": {"number_of_fragments": 1000, "fragment_size" : 500}
        }
    }
}

The idea is to locate the relevant data labelled around "Paint Colours" using ElasticSearch and then process the returned fragment to extract data present after the the label.

For example, if there is a document like:

...some text...

Word1
Word2
Paint Colours: Red, Green

...more text...

I would like to extract [Red, Green] out of this document.

The problem I am facing is ElasticSearch returns highlighted fragments which look like this:

Word2
Paint Colours

And if I increase the fragment size, it looks something like this:

Word1
Word2
Paint Colours

But what I'm really looking for is to have the search term somewhere in the middle of the fragment so that there is text AFTER the search term as well. Like this:

Word2
Paint Colours: Red, Green

This would help extract required text that is next to the search term.