WordDelimiterTokenFilter used twice in same analyzer with different configurations causes issues


(Atul Bagga) #1

ES 5.4.1

Config-1
GenerateWordParts = true, // [wi-fi] ---> [wi,fi]
GenerateNumberParts = true, // [12-03] ---> [12,03]
CatenateWords = false, // [wi-fi] -/-> [wifi]
CatenateNumbers = false, // [12-03] -/-> [1203]
CatenateAll = false, // [wi-fi-12] -/-> [wifi12]
SplitOnCaseChange = false, // [WiFi] -/-> [Wi,Fi]
PreserveOriginal = false, // [wi-fi] -/-> [wi-fi,wi,fi]
SplitOnNumerics = false, // [j2ee] -/-> [j,2,ee]
StemEnglishPossessive = true // [Jack's] ---> [Jack]

Config-2
GenerateWordParts = true, // [wi-fi] ---> [wi,fi]
GenerateNumberParts = true, // [12-03] ---> [12,03]
CatenateWords = false, // [wi-fi] -/-> [wifi]
CatenateNumbers = false, // [12-03] -/-> [1203]
CatenateAll = false, // [wi-fi-12] -/-> [wifi12]
SplitOnCaseChange = true, // [WiFi] -/-> [Wi,Fi]
PreserveOriginal = true, // [wi-fi] -/-> [wi-fi,wi,fi]
SplitOnNumerics = true, // [j2ee] -/-> [j,2,ee]
StemEnglishPossessive = true // [Jack's] ---> [Jack]

I am using these two configs of wordDelimiterFilter on same analyzer this starts giving error "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards" on indexing text like "AtulBagga.TestConfig". Is this an issue with wordDelimiterTokenFilter on lucene?

Requirement: I want the tokens in such a way that "AtulBagga24.TestConfig" is searchable with all of the following keywords-
atul, bagga, atulbagga, atulbagga24, atulbagga24.textConfig, test, config, testconfig, 24

Is there a way to solve this if above approach has known issues?


(Alan Woodward) #2

Rather than chaining WordDelimiterFilter (which will often break offsets, as you've found), can you instead use a char_filter to remove the period? Something like:
POST testindex/_analyze
{
"char_filter": [
{ "type" : "pattern_replace",
"pattern" : "\\.",
"replacement" : " "
}
],
"tokenizer": "standard",
"filter" : [
{"type": "word_delimiter",
"generate_word_parts": "true",
"generate_number_parts": "true",
"catenate_words": "true",
"catenate_numbers": "false",
"catenate_all": "false",
"split_on_case_change": "true",
"preserve_original": "true",
"split_on_numerics": "true",
"stem_english_possessive": "true"},
"lowercase"
],
"text": "AtulBagga24.TestProject"
}


(Atul Bagga) #3

Thanks a lot for reply.
This won't work (It will not match "atulbagga24 testproject" because of position offsets.

I also think there is also a genuine issue here in WordDelimiterTokenFilter WITHOUT chaining which I filed but got closed on github [https://github.com/elastic/elasticsearch/issues/28439]


(Alan Woodward) #4

I think it will work if you use word_delimiter_graph instead of word_delimiter - the graph version also records positionLength, which is then used by query parsers to correctly construct phrase queries with gaps.


(Atul Bagga) #5

Thanks! It doesn't seem to work even with graph but i will try something with it.

Can you help me with confirming that this is a genuine issue [https://github.com/elastic/elasticsearch/issues/28439 ?


(Alan Woodward) #6

Hi @Atul_Bagga,

I think this is an issue in WordDelimiterFilter itself (so in lucene, rather than in ES). As currently implemented, if a token is broken on case change multiple times, catenate_words will then string all of those subtokens together, but it won't produce the intermediate tokens. For example, the token 'OneTwoThree' would produce 'One', 'Two', 'Three' and 'OneTwoThree', but not 'OneTwo' or 'TwoThree'.

In your example, is 'AtulBagga24.TestProject' a standalone field, or part of a larger run of text? There might be ways to combine WDF with shingles if it's standalone.

Stringing together WordDelimiterFilters will always be broken, it looks like, because neither WDF nor WDGF can consume token graphs, and they both produce graphs (correctly in the case of WDGF, broken ones in the case of WDF). Again, that's a lucene issue.


(Atul Bagga) #7

Thanks @AlanWoodward!

It is a standalone field. Yes, Shingle filter solves the problem partially. (AtulBagga24 and TestProject are generated but the positions are still not adjacent)

For fixing this I am keeping the same field analyzed using different analyzer and doing a search on both fields.
Combining highlighting from two fields is a bit of pain in that case but I can live with that for now. I see there is already an open issue to combine highlights for unified highlighter for multi field approach.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.