Highlighting issue with wildcard query string query

drippel · March 10, 2016, 11:05am

I have an index where the default indexing analyzer is defined this way:

"default" : {
	"filter" : [
		"english_possessive_stemmer",
		"segment_worddelimiter",
		"lowercase",
		"english_stemmer",
		"english_common_grams_index",
                "english_phonetic_filter"
	],
	"char_filter" : [
		"html_strip"
	],
	"type" : "custom",
	"tokenizer" : "english_pattern_tokenizer"
}

where the tokenizer is defined like this :

"english_pattern_tokenizer" : {
	"flags" : "UNICODE_CHAR_CLASS",
	"pattern" : "\s+",
	"type" : "pattern",
	"group" : "-1"
}

and the common grams filter is defined like this :

"english_common_grams_index" : {
	"ignore_case" : "true",
	"type" : "common_grams",
	"query_mode" : "false",
	"common_words" : [
		// here comes a long list of english stop-words
	]
}

and the phonetic filter is defined as:

"english_phonetic_filter" : {
	"max_code_len" : "4",
	"replace" : "false",
	"type" : "phonetic",
	"encoder" : "doublemetaphone"
}

I index the following document :

{
	"message" : "training with the stars in English does not interest them or us in the U.S."
}

Now I run the following search :

{
	"query" : {
		"bool" : {
			"should" : {
				"query_string" : {
					"query" : "sta*",
					"fields" : [
						"message"
					],
                                        "analyzer": "standard"
				}
			}
		}
	},
	"highlight" : {
		"pre_tags" : [
			"<b>"
		],
		"post_tags" : [
			"</b>"
		],
		"fragment_size" : 0,
		"number_of_fragments" : 0,
		"require_field_match" : false,
		"fields" : {
			"message" : {}
		}
	}
}

and I get the following hit :

{
	"_index" : "fts-english",
	"_type" : "Document",
	"_id" : "AVNfwtiy7ixxxIg9v3K6",
	"_score" : 1,
	"_source" : {
		"message" : "training with the stars in English does not interest them or us in the U.S."
	},
	"highlight" : {
		"message" : [
			"training with the <b>stars in</b> English does not interest them or us in the U.S."
		]
	}
}

The problem is that the highlight catches "stars in" instead of only "stars".

nik9000 · March 10, 2016, 12:08pm

Weird. It'd be useful to know the output of the analyze API for that message. Also might be useful to know which highlighter you are using - for that you need to know the mapping.

drippel · March 13, 2016, 10:12am

Thanks for your message, nik9000.

The analyze output for the "default" indexing analyzer, as seen in the Inquisitor site plugin, is :

train 0 TRN 0 train_with 0 TRN0 0 TRNT 0 with 1 A0 1 FT 1 with_the 1 A00 1 FTT 1 the 2 0 2 T 2 the_star 2 0STR 2 TSTR 2 star 3 STR 3 star_in 3 STRN 3 in 4 AN 4 in_english 4 ANNK 4 ANNL 4 english 5 ANKL 5 ANLX 5 doe 6 T 6 doe_not 6 TNT 6 not 7 NT 7 not_interest 7 NTNT 7 interest 8 ANTR 8 interest_them 8 ANTR 8 them 9 0M 9 TM 9 them_or 9 0MR 9 TMR 9 or 10 AR 10 or_us 10 ARS 10 us 11 AS 11 us_in 11 ASN 11 in 12 AN 12 in_the 12 AN0 12 ANT 12 the 13 0 13 T 13 the_u.s.13 0S 13 TS 13 u.s.14 AS 14

the analyze output for the standard analyzer that I use for search is :

training 0 with 1 the 2 stars 3 in 4 english 5 does 6 not 7 interest 8 them 9 or 10 us 11 in 12 the 13 u.s 14

I use the default highlighter, and the mapping for this index is :

"mappings" : {
	"system" : {
		"properties" : {
			"ftsIndexVersion" : {
				"type" : "long"
			},
			"message" : {
				"type" : "string"
			}
		}
	},
	"Document" : {
		"dynamic_templates" : [{
				"not_analyzed_fields" : {
					"match_pattern" : "regex",
					"mapping" : {
						"include_in_all" : false,
						"index" : "not_analyzed",
						"type" : "string"
					},
					"match" : "(language|doc_id)"
				}
			}
		],
		"properties" : {
			"TweetPost__User URL" : {
				"type" : "string"
			},
			"TweetPost__Body" : {
				"type" : "string"
			},
			"streamId" : {
				"type" : "long"
			},
			"WordB" : {
				"type" : "string"
			},
			"WordC" : {
				"type" : "string"
			},
			"TweetPost__Full Name" : {
				"type" : "string"
			},
			"WordA" : {
				"type" : "string"
			},
			"language" : {
				"include_in_all" : false,
				"index" : "not_analyzed",
				"type" : "string"
			},
			"TweetPost__User ID" : {
				"type" : "string"
			},
			"TweetPost__Username" : {
				"type" : "string"
			},
			"message" : {
				"type" : "string"
			},
			"doc_id" : {
				"include_in_all" : false,
				"index" : "not_analyzed",
				"type" : "string"
			},
			"TweetPost__Profile Image URL" : {
				"type" : "string"
			},
			"TweetPost__Tweet ID" : {
				"type" : "string"
			},
			"TweetPost__Post Time (Label)" : {
				"type" : "string"
			},
			"postDate" : {
				"format" : "strict_date_optional_time||epoch_millis",
				"type" : "date"
			},
			"user" : {
				"type" : "string"
			},
			"TweetPost__Tweet URL" : {
				"type" : "string"
			}
		}
	}
}

drippel · March 30, 2016, 6:10am

Can anybody from Elastic look at this, please?

Topic		Replies	Views
"query_string" dosen't analyze wildcard queries Elasticsearch	5	4847	December 28, 2017
Problem about index_analyzer and search_analyzer Elasticsearch	1	328	July 6, 2017
Custom analysis, phonetic filter and highlighting Elasticsearch	2	440	July 6, 2017
Highlighting in a a search query Elastic Search	6	341	July 8, 2024
What analyzer does query_string use for highlighting? Elasticsearch	4	1595	July 6, 2017

Highlighting issue with wildcard query string query

Related topics