Char_Filter pattern replace is not behaving correctly

ahiggins · April 5, 2023, 2:44pm

Elasticsearch Version 7.178

I am trying to work on a custom analyzer that would fix problematic texts in our database that contain unsearchable texts that are being obscured by existing '\u200c' characters, or half-space characters.

Example text here:

t‌e‌s‌t‌ t‌e‌x‌t‌

The example string above contains these half-space characters, making match queries for 'test' and 'text' incorrect for a standard analyzer.

I have created a custom analyzer and have been trying to test its effectiveness in removing these half-spaces.

As we can see below, using a standard normalizer, the hidden characters are retained and shown in the token start_offset and end_offset to reveal the character count including the hidden characters.

GET localhost:9200/_analyze

{
    "tokenizer": "standard",
    "text": "‌t‌e‌s‌t‌ t‌e‌x‌t‌"
}

Response:

{
    "tokens": [
        {
            "token": "t‌e‌s‌t‌",
            "start_offset": 1,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "t‌e‌x‌t‌",
            "start_offset": 10,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

I then attempted to use a custom char_filter to replace these '\u200c' characters with empty spaces, in order to allow for matches on 'text' and 'test'.

GET localhost:9200/_analyze
{
    "tokenizer": "standard",
    "char_filter": {
            "type": "pattern_replace",
            "pattern": "\u200c",
            "replacement": ""
    },
    "text": "‌t‌e‌s‌t‌ t‌e‌x‌t‌"
}

Response:
{
    "tokens": [
        {
            "token": "test",
            "start_offset": 1,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "text",
            "start_offset": 10,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

the start offset and character counts still match the same as without any replacements.

But, when I change the replacement to "." for example, it appears to be replacing correctly.

{
    "tokenizer": "standard",
    "char_filter": {
            "type": "pattern_replace",
            "pattern": "\u200c",
            "replacement": "."
    },
    "text": "‌t‌e‌s‌t‌ t‌e‌x‌t‌"
}

Response:

{
    "tokens": [
        {
            "token": "t.e.s.t",
            "start_offset": 1,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "t.e.x.t",
            "start_offset": 10,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

So, this response shows to me that it appears to be replacing the characters correctly. But it doesn't seem to be removing them correctly at all.
Any advice or look into the issue would be extremely helpful.

system · May 3, 2023, 2:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pattern_replace char filter regex Elasticsearch	2	707	June 28, 2017
Pattern Replace Character Filter In a Normalizer Elasticsearch	1	584	May 27, 2021
ElasticSearch 5.3 filterer char_filter. pattern_replace not working Elasticsearch	5	1201	August 29, 2017
Pattern_replace Token Filter Elasticsearch	1	302	July 6, 2017
Problem configuring PatternReplaceFilter Elasticsearch	7	716	July 6, 2017

Char_Filter pattern replace is not behaving correctly

Related topics