Elasticsearch Version 7.178
I am trying to work on a custom analyzer that would fix problematic texts in our database that contain unsearchable texts that are being obscured by existing '\u200c' characters, or half-space characters.
Example text here:
test text
The example string above contains these half-space characters, making match queries for 'test' and 'text' incorrect for a standard analyzer.
I have created a custom analyzer and have been trying to test its effectiveness in removing these half-spaces.
As we can see below, using a standard normalizer, the hidden characters are retained and shown in the token start_offset and end_offset to reveal the character count including the hidden characters.
GET localhost:9200/_analyze
{
"tokenizer": "standard",
"text": "test text"
}
Response:
{
"tokens": [
{
"token": "test",
"start_offset": 1,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "text",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 1
}
]
}
I then attempted to use a custom char_filter to replace these '\u200c' characters with empty spaces, in order to allow for matches on 'text' and 'test'.
GET localhost:9200/_analyze
{
"tokenizer": "standard",
"char_filter": {
"type": "pattern_replace",
"pattern": "\u200c",
"replacement": ""
},
"text": "test text"
}
Response:
{
"tokens": [
{
"token": "test",
"start_offset": 1,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "text",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 1
}
]
}
the start offset and character counts still match the same as without any replacements.
But, when I change the replacement to "." for example, it appears to be replacing correctly.
{
"tokenizer": "standard",
"char_filter": {
"type": "pattern_replace",
"pattern": "\u200c",
"replacement": "."
},
"text": "test text"
}
Response:
{
"tokens": [
{
"token": "t.e.s.t",
"start_offset": 1,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "t.e.x.t",
"start_offset": 10,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 1
}
]
}
So, this response shows to me that it appears to be replacing the characters correctly. But it doesn't seem to be removing them correctly at all.
Any advice or look into the issue would be extremely helpful.