WordDelimiterTokenFilter used twice in same analyzer with different configurations causes issues

Atul_Bagga · January 19, 2018, 3:00pm

ES 5.4.1

Config-1
GenerateWordParts = true, // [wi-fi] ---> [wi,fi]
GenerateNumberParts = true, // [12-03] ---> [12,03]
CatenateWords = false, // [wi-fi] -/-> [wifi]
CatenateNumbers = false, // [12-03] -/-> [1203]
CatenateAll = false, // [wi-fi-12] -/-> [wifi12]
SplitOnCaseChange = false, // [WiFi] -/-> [Wi,Fi]
PreserveOriginal = false, // [wi-fi] -/-> [wi-fi,wi,fi]
SplitOnNumerics = false, // [j2ee] -/-> [j,2,ee]
StemEnglishPossessive = true // [Jack's] ---> [Jack]

Config-2
GenerateWordParts = true, // [wi-fi] ---> [wi,fi]
GenerateNumberParts = true, // [12-03] ---> [12,03]
CatenateWords = false, // [wi-fi] -/-> [wifi]
CatenateNumbers = false, // [12-03] -/-> [1203]
CatenateAll = false, // [wi-fi-12] -/-> [wifi12]
SplitOnCaseChange = true, // [WiFi] -/-> [Wi,Fi]
PreserveOriginal = true, // [wi-fi] -/-> [wi-fi,wi,fi]
SplitOnNumerics = true, // [j2ee] -/-> [j,2,ee]
StemEnglishPossessive = true // [Jack's] ---> [Jack]

I am using these two configs of wordDelimiterFilter on same analyzer this starts giving error "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards" on indexing text like "AtulBagga.TestConfig". Is this an issue with wordDelimiterTokenFilter on lucene?

Requirement: I want the tokens in such a way that "AtulBagga24.TestConfig" is searchable with all of the following keywords-
atul, bagga, atulbagga, atulbagga24, atulbagga24.textConfig, test, config, testconfig, 24

Is there a way to solve this if above approach has known issues?

AlanWoodward · January 31, 2018, 9:20am

Rather than chaining WordDelimiterFilter (which will often break offsets, as you've found), can you instead use a char_filter to remove the period? Something like:
POST testindex/_analyze
{
"char_filter": [
{ "type" : "pattern_replace",
"pattern" : "\\.",
"replacement" : " "
}
],
"tokenizer": "standard",
"filter" : [
{"type": "word_delimiter",
"generate_word_parts": "true",
"generate_number_parts": "true",
"catenate_words": "true",
"catenate_numbers": "false",
"catenate_all": "false",
"split_on_case_change": "true",
"preserve_original": "true",
"split_on_numerics": "true",
"stem_english_possessive": "true"},
"lowercase"
],
"text": "AtulBagga24.TestProject"
}

Atul_Bagga · February 1, 2018, 7:27am

Thanks a lot for reply.
This won't work (It will not match "atulbagga24 testproject" because of position offsets.

I also think there is also a genuine issue here in WordDelimiterTokenFilter WITHOUT chaining which I filed but got closed on github [https://github.com/elastic/elasticsearch/issues/28439]

AlanWoodward · February 1, 2018, 10:49am

I think it will work if you use word_delimiter_graph instead of word_delimiter - the graph version also records positionLength, which is then used by query parsers to correctly construct phrase queries with gaps.

Atul_Bagga · February 1, 2018, 2:54pm

Thanks! It doesn't seem to work even with graph but i will try something with it.

Can you help me with confirming that this is a genuine issue [https://github.com/elastic/elasticsearch/issues/28439 ?

AlanWoodward · February 15, 2018, 2:36pm

Hi @Atul_Bagga,

I think this is an issue in WordDelimiterFilter itself (so in lucene, rather than in ES). As currently implemented, if a token is broken on case change multiple times, catenate_words will then string all of those subtokens together, but it won't produce the intermediate tokens. For example, the token 'OneTwoThree' would produce 'One', 'Two', 'Three' and 'OneTwoThree', but not 'OneTwo' or 'TwoThree'.

In your example, is 'AtulBagga24.TestProject' a standalone field, or part of a larger run of text? There might be ways to combine WDF with shingles if it's standalone.

Stringing together WordDelimiterFilters will always be broken, it looks like, because neither WDF nor WDGF can consume token graphs, and they both produce graphs (correctly in the case of WDGF, broken ones in the case of WDF). Again, that's a lucene issue.

Atul_Bagga · February 21, 2018, 8:55am

Thanks @AlanWoodward!

It is a standalone field. Yes, Shingle filter solves the problem partially. (AtulBagga24 and TestProject are generated but the positions are still not adjacent)

For fixing this I am keeping the same field analyzed using different analyzer and doing a search on both fields.
Combining highlighting from two fields is a bit of pain in that case but I can live with that for now. I see there is already an open issue to combine highlights for unified highlighter for multi field approach.

system · March 21, 2018, 8:55am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issue with using word delimiter Elasticsearch	1	604	July 6, 2017
Issue with using word delimiter filter Elasticsearch	5	567	July 6, 2017
Word Delimiter Filter Elasticsearch	1	300	July 6, 2017
Different behaviour b/w custom and original Word Delimiter Token Filter Elasticsearch	5	396	July 6, 2017
Word_delimiter Filter et position Discussions en français	2	1742	July 28, 2017

WordDelimiterTokenFilter used twice in same analyzer with different configurations causes issues

Related topics