Highlighting problem when _analyzer field specifies analyzer to be used


(arta) #1

I'm using _analyzer field to specify which analyzer to be used for the document when indexing.
I need to use this feature because I don't know which language the document is written in, until I index it.
With this setting, I'm not getting "highlight" field when I query.
Please help me out and let me know how I can get highlighting.


Here's my unsuccessful attempt:

$ curl -XPUT 'http://localhost:9200/test/'
{"ok":true,"acknowledged":true}
$
$ curl -XPUT 'http://localhost:9200/test/1/_mapping' -d '{"1":{"properties":{"content":{"type":"string","store":"yes","index":"analyzed"}}}}'
{"ok":true,"acknowledged":true}
$
$ curl -XPUT 'http://localhost:9200/test/1/1' -d '{"content":"I worked from home on the other day.","_analyzer":"english"}'
{"ok":true,"_index":"test","_type":"1","_id":"1","_version":1}
$
$ curl -XGET 'http://localhost:9200/test/1/_search?pretty=true' -d '{"query":{"text":{"content":{"query":"working","analyzer":"english"}}},"highlight":{"fields":{"content":{}}},"fields":["content","highlight"]}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.11506981,
"hits" : [ {
"_index" : "test",
"_type" : "1",
"_id" : "1",
"_score" : 0.11506981,
"fields" : {
"content" : "I worked from home on the other day."
}
} ]
}
}
As you can see, there is no "highlight" field in the result.


Here's the working case, where I specify the analyzer in the mapping explicitly:

$ curl -XPUT 'http://localhost:9200/test/'
{"ok":true,"acknowledged":true}

$ curl -XPUT 'http://localhost:9200/test/1/_mapping' -d '{"1":{"properties":{"content":{"type":"string","store":"yes","index":"analyzed","analyzer":"english"}}}}'
{"ok":true,"acknowledged":true}

$ curl -XPUT 'http://localhost:9200/test/1/1' -d '{"content":"I worked from home on the other day."}'
{"ok":true,"_index":"test","_type":"1","_id":"1","_version":1}

$ curl -XGET 'http://localhost:9200/test/1/_search?pretty=true' -d '{"query":{"text":{"content":"working"}},"highlight":{"fields":{"content":{}}},"fields":["content","highlight"]}'
{
"took" : 53,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.11506981,
"hits" : [ {
"_index" : "test",
"_type" : "1",
"_id" : "1",
"_score" : 0.11506981,
"fields" : {
"content" : "I worked from home on the other day."
},
"highlight" : {
"content" : [ "I worked from home on the other day." ]
}
} ]
}
}


(arta) #2

Sorry, forgot to mention.
I'm using 0.19.1.


(Shay Banon) #3

When you use the default highlighter, it reanalyzes the data for hte field
(if the field is stored, it loads it, if not, it loads the _source and gets
it from there). That analysis part is done based on the analyzer defined in
the mappings (and not based on your custom analyzer per field). Try and use
term vectors based highlighter by enabling term vectors in the mapping, see
if it helps, since in this case, there is no need to analyze the text again.

On Tue, Mar 27, 2012 at 11:26 PM, arta artasano@sbcglobal.net wrote:

Sorry, forgot to mention.
I'm using 0.19.1.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/highlighting-problem-when-analyzer-field-specifies-analyzer-to-be-used-tp3862819p3862821.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(arta) #4

Thanks for the quick reply, Kimchy.
It worked fine by specifying "term_vector":"with_positions_offsets" in the mapping of this field.
I also tried "yes", "with_offsets" and "with_positions", but they did not work (I did not get the highlight field in the result).

It is said that the term vector increases highlighting performance at cost of index size.
How much, in theory, do you expect the increase, like 10% or 100%?


(arta) #5

I quickly experiment measuring the index size difference.
With my small environment (4000+ documents, avg content size is about 100KB),
"term_vector":"with_positions_offsets" made the index size 2.3 times bigger
than the one with "term_vector":"no" (default).


(system) #6