I'm trying to remove http://
and/or www
from a url, as well as ?...
, so basically strip it to it's minimal form. I'm using the pattern_replace
char_filter
(ES version 1.7). And having trouble matching "optional" regex pattern:
So here's my filter:
"url_strip_filter": {
"type": "pattern_replace",
"pattern": "(?:http:\\/\\/)?(?:www\\.)?([^\\?]+)(?:\\?.*)?",
"replacement": "$1"
}
And when I use the ES _analayze
endpoint for that filter, it is not stripping out the optional groups:
GET /my_index/_analyze?char_filters=url_strip_filter&tokenizer=keyword
{
"text": "http://www.blah.com/?foo"
}
{
"tokens": [
{
"token": "{\n \"text\": \"http://www.blah.com/\n}\n",
"start_offset": 0,
"end_offset": 43,
"type": "word",
"position": 1
}
]
}
But when I remove the ?
for optional, it works:
"url_strip_filter": {
"type": "pattern_replace",
"pattern": "(?:http:\\/\\/)(?:www\\.)([^\\?]+)(?:\\?.*)?",
"replacement": "$1"
}
{
"tokens": [
{
"token": "{\n \"text\": \"blah.com/\n}\n",
"start_offset": 0,
"end_offset": 43,
"type": "word",
"position": 1
}
]
}
I don't think it's a regex issue bc I tested the regex and it works as expected: https://regex101.com/r/Hd9E2V/3.
Not sure what I'm missing here, but any help would be appreciated!