I have a custom pattern tokenizer that creates tokens which include colon (:). I also have a custom pattern_capture token filter. I have preserve_original set to false.
The problem is that tokens which don't match any of my patterns are being passed through the filter.
Here's a simplified test case:
"filter": {
"ipnwords": {
"type": "pattern_capture",
"preserve_original": false,
"patterns": [
"([a-z0-9][a-z0-9-]+[a-z0-9])"
]
}
},
"analyzer": {
"words": {
"tokenizer": "ipwords",
"filter": ["lowercase", "ipnwords", "unique"]
}
},
"tokenizer": {
"ipwords": {
"type": "pattern",
"pattern": "[a-zA-Z0-9:\\.-]+",
"group": 0
}
}
When I pass a test document to /_analyze that includes the string :::, that string is returned as a token, even though it should never match the pattern in the filter.
This seems like a bug (preserve_orginal=false ignored when no pattern matched in pattern_capture), but maybe someone can point out the error of my ways.
Thanks in advance for any help/advice. Let me know if I can provide more detail or help with testing/bugfix.