I have an index where the default indexing analyzer is defined this way:
"default" : {
"filter" : [
"english_possessive_stemmer",
"segment_worddelimiter",
"lowercase",
"english_stemmer",
"english_common_grams_index",
"english_phonetic_filter"
],
"char_filter" : [
"html_strip"
],
"type" : "custom",
"tokenizer" : "english_pattern_tokenizer"
}
where the tokenizer is defined like this :
"english_pattern_tokenizer" : {
"flags" : "UNICODE_CHAR_CLASS",
"pattern" : "\s+",
"type" : "pattern",
"group" : "-1"
}
and the common grams filter is defined like this :
"english_common_grams_index" : {
"ignore_case" : "true",
"type" : "common_grams",
"query_mode" : "false",
"common_words" : [
// here comes a long list of english stop-words
]
}
and the phonetic filter is defined as:
"english_phonetic_filter" : {
"max_code_len" : "4",
"replace" : "false",
"type" : "phonetic",
"encoder" : "doublemetaphone"
}
I index the following document :
{
"message" : "training with the stars in English does not interest them or us in the U.S."
}
Now I run the following search :
{
"query" : {
"bool" : {
"should" : {
"query_string" : {
"query" : "sta*",
"fields" : [
"message"
],
"analyzer": "standard"
}
}
}
},
"highlight" : {
"pre_tags" : [
"<b>"
],
"post_tags" : [
"</b>"
],
"fragment_size" : 0,
"number_of_fragments" : 0,
"require_field_match" : false,
"fields" : {
"message" : {}
}
}
}
and I get the following hit :
{
"_index" : "fts-english",
"_type" : "Document",
"_id" : "AVNfwtiy7ixxxIg9v3K6",
"_score" : 1,
"_source" : {
"message" : "training with the stars in English does not interest them or us in the U.S."
},
"highlight" : {
"message" : [
"training with the <b>stars in</b> English does not interest them or us in the U.S."
]
}
}
The problem is that the highlight catches "stars in" instead of only "stars".