I have a custom pattern
tokenizer that creates tokens which include colon (:
). I also have a custom pattern_capture
token filter. I have preserve_original
set to false
.
The problem is that tokens which don't match any of my patterns are being passed through the filter.
Here's a simplified test case:
"filter": {
"ipnwords": {
"type": "pattern_capture",
"preserve_original": false,
"patterns": [
"([a-z0-9][a-z0-9-]+[a-z0-9])"
]
}
},
"analyzer": {
"words": {
"tokenizer": "ipwords",
"filter": ["lowercase", "ipnwords", "unique"]
}
},
"tokenizer": {
"ipwords": {
"type": "pattern",
"pattern": "[a-zA-Z0-9:\\.-]+",
"group": 0
}
}
When I pass a test document to /_analyze
that includes the string :::
, that string is returned as a token, even though it should never match the pattern in the filter.
This seems like a bug (preserve_orginal=false
ignored when no pattern matched in pattern_capture
), but maybe someone can point out the error of my ways.
Thanks in advance for any help/advice. Let me know if I can provide more detail or help with testing/bugfix.