Hi!
When using a combination of delimiter_graph and shingles, I do not get the results that I expect.
Consider the following use case:
DELETE test
PUT test
{
"settings": {
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
"analyzer": {
"delimited_shingles": {
"tokenizer": "whitespace",
"filter": [
"trim",
"asciifolding",
"delimiter",
"lowercase",
"shingle"
]
}
},
"filter": {
"delimiter": {
"type": "word_delimiter_graph",
"type_table": [
"$ => DIGIT",
"& => DIGIT",
"% => DIGIT",
"\\u002C => DIGIT",
"\\u002B => DIGIT"
],
"stem_english_possessive": "false",
"catenate_all": "true",
"catenate_words": "true",
"preserve_original": "true",
"generate_word_parts": "false",
"split_on_numerics": "true",
"split_on_case_change": "false"
},
"shingle": {
"max_shingle_size": "2",
"min_shingle_size": "2",
"token_separator": "_",
"output_unigrams": "false",
"type": "shingle"
}
}
}
}
}
GET test/_analyze
{
"explain": true,
"analyzer": "delimited_shingles",
"text": [
"grey's anatomy"
]
}
#Expected:
#greys_anatomy
#grey's_anatomy
#Actual:
#greys_grey's
#greys_anatomy
Based on start_offset, end_offset, position and position_length, I would expect the shingle analyzer to create shingles out of the correct output from the delimiter.
Is this because non-graph analyzers cannot handle this kind of output? Is this something that would have to be fixed in Lucene or do I have wrong expectations?
In case more info is needed, let me know!
Thanks in advance,