Highlighting issue with wildcard query string query

I have an index where the default indexing analyzer is defined this way:

"default" : {
	"filter" : [
		"english_possessive_stemmer",
		"segment_worddelimiter",
		"lowercase",
		"english_stemmer",
		"english_common_grams_index",
                "english_phonetic_filter"
	],
	"char_filter" : [
		"html_strip"
	],
	"type" : "custom",
	"tokenizer" : "english_pattern_tokenizer"
}

where the tokenizer is defined like this :

"english_pattern_tokenizer" : {
	"flags" : "UNICODE_CHAR_CLASS",
	"pattern" : "\s+",
	"type" : "pattern",
	"group" : "-1"
}

and the common grams filter is defined like this :

"english_common_grams_index" : {
	"ignore_case" : "true",
	"type" : "common_grams",
	"query_mode" : "false",
	"common_words" : [
		// here comes a long list of english stop-words
	]
}

and the phonetic filter is defined as:

"english_phonetic_filter" : {
	"max_code_len" : "4",
	"replace" : "false",
	"type" : "phonetic",
	"encoder" : "doublemetaphone"
}

I index the following document :

{
	"message" : "training with the stars in English does not interest them or us in the U.S."
}

Now I run the following search :

{
	"query" : {
		"bool" : {
			"should" : {
				"query_string" : {
					"query" : "sta*",
					"fields" : [
						"message"
					],
                                        "analyzer": "standard"
				}
			}
		}
	},
	"highlight" : {
		"pre_tags" : [
			"<b>"
		],
		"post_tags" : [
			"</b>"
		],
		"fragment_size" : 0,
		"number_of_fragments" : 0,
		"require_field_match" : false,
		"fields" : {
			"message" : {}
		}
	}
}

and I get the following hit :

{
	"_index" : "fts-english",
	"_type" : "Document",
	"_id" : "AVNfwtiy7ixxxIg9v3K6",
	"_score" : 1,
	"_source" : {
		"message" : "training with the stars in English does not interest them or us in the U.S."
	},
	"highlight" : {
		"message" : [
			"training with the <b>stars in</b> English does not interest them or us in the U.S."
		]
	}
}

The problem is that the highlight catches "stars in" instead of only "stars".

Weird. It'd be useful to know the output of the analyze API for that message. Also might be useful to know which highlighter you are using - for that you need to know the mapping.

Thanks for your message, nik9000.

The analyze output for the "default" indexing analyzer, as seen in the Inquisitor site plugin, is :

train 0 TRN 0 train_with 0 TRN0 0 TRNT 0 with 1 A0 1 FT 1 with_the 1 A00 1 FTT 1 the 2 0 2 T 2 the_star 2 0STR 2 TSTR 2 star 3 STR 3 star_in 3 STRN 3 in 4 AN 4 in_english 4 ANNK 4 ANNL 4 english 5 ANKL 5 ANLX 5 doe 6 T 6 doe_not 6 TNT 6 not 7 NT 7 not_interest 7 NTNT 7 interest 8 ANTR 8 interest_them 8 ANTR 8 them 9 0M 9 TM 9 them_or 9 0MR 9 TMR 9 or 10 AR 10 or_us 10 ARS 10 us 11 AS 11 us_in 11 ASN 11 in 12 AN 12 in_the 12 AN0 12 ANT 12 the 13 0 13 T 13 the_u.s.13 0S 13 TS 13 u.s.14 AS 14

the analyze output for the standard analyzer that I use for search is :

training 0 with 1 the 2 stars 3 in 4 english 5 does 6 not 7 interest 8 them 9 or 10 us 11 in 12 the 13 u.s 14

I use the default highlighter, and the mapping for this index is :

"mappings" : {
	"system" : {
		"properties" : {
			"ftsIndexVersion" : {
				"type" : "long"
			},
			"message" : {
				"type" : "string"
			}
		}
	},
	"Document" : {
		"dynamic_templates" : [{
				"not_analyzed_fields" : {
					"match_pattern" : "regex",
					"mapping" : {
						"include_in_all" : false,
						"index" : "not_analyzed",
						"type" : "string"
					},
					"match" : "(language|doc_id)"
				}
			}
		],
		"properties" : {
			"TweetPost__User URL" : {
				"type" : "string"
			},
			"TweetPost__Body" : {
				"type" : "string"
			},
			"streamId" : {
				"type" : "long"
			},
			"WordB" : {
				"type" : "string"
			},
			"WordC" : {
				"type" : "string"
			},
			"TweetPost__Full Name" : {
				"type" : "string"
			},
			"WordA" : {
				"type" : "string"
			},
			"language" : {
				"include_in_all" : false,
				"index" : "not_analyzed",
				"type" : "string"
			},
			"TweetPost__User ID" : {
				"type" : "string"
			},
			"TweetPost__Username" : {
				"type" : "string"
			},
			"message" : {
				"type" : "string"
			},
			"doc_id" : {
				"include_in_all" : false,
				"index" : "not_analyzed",
				"type" : "string"
			},
			"TweetPost__Profile Image URL" : {
				"type" : "string"
			},
			"TweetPost__Tweet ID" : {
				"type" : "string"
			},
			"TweetPost__Post Time (Label)" : {
				"type" : "string"
			},
			"postDate" : {
				"format" : "strict_date_optional_time||epoch_millis",
				"type" : "date"
			},
			"user" : {
				"type" : "string"
			},
			"TweetPost__Tweet URL" : {
				"type" : "string"
			}
		}
	}
}

Can anybody from Elastic look at this, please? :confused: