Char_filter doesn't work properly

Vladimir_Talabko · July 25, 2023, 6:53pm

Hello!
I have an index with a char_filter for avoiding special symbols like "-_.":

curl -X PUT "localhost:9200/test_index?pretty" -H 'Content-Type: application/json' -d'
{
    "settings": {
        "analysis": {
            "char_filter": {
                "specials_char_filter": {
                    "type": "mapping",
                    "mappings": [ "- =>", ". =>", "_ =>" ]
                }
            },
            "analyzer": {
                "articul_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "char_filter": [
                        "html_strip", "specials_char_filter"
                    ],
                    "filter": [
                        "lowercase",
                        "trim"
                    ]
                }
            }
        }
    }
}
'

I've checked the analyzer with the next command:

curl -X POST "localhost:9200/test_index/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "analyzer": "articul_analyzer",
    "text": "<p>U-_.298 </p> "
}
'

It works, I'm getting "token" : "u298",

After indexing a few records I'm trying the next search request:

curl -X GET "localhost:9200/test_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
        "query": "U29",
        "default_field": "articul_indexed"
    }
  }
}
'

And it doesn't work. =(( It works for U-289, 289, and others, but for U298 doesn't.
Could you give me some clue how to get it worked?

RabBit_BR · July 26, 2023, 1:49am

Hi @Vladimir_Talabko

Did you try to use wildcards?

dadoonet · July 26, 2023, 7:12am

It's because u29 has not been indexed as a token but u298 so this can not match.

If you want to do some prefix search, you could add edge n grams to your analyzer:

DELETE /test_index
PUT /test_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "specials_char_filter": {
          "type": "mapping",
          "mappings": [
            "- =>",
            ". =>",
            "_ =>"
          ]
        }
      },
      "filter": {
        "prefix": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 4
        }
      },
      "analyzer": {
        "articul_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "char_filter": [
            "html_strip",
            "specials_char_filter"
          ],
          "filter": [
            "lowercase",
            "trim"
          ]
        },
        "articul_analyzer_prefix": {
          "type": "custom",
          "tokenizer": "whitespace",
          "char_filter": [
            "html_strip",
            "specials_char_filter"
          ],
          "filter": [
            "lowercase",
            "trim",
            "prefix"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "foo": {
        "type": "text",
        "analyzer": "articul_analyzer_prefix",
        "search_analyzer": "articul_analyzer"
      }
    }
  }
}

# At index time
POST /test_index/_analyze?pretty
{
    "analyzer": "articul_analyzer_prefix",
    "text": "<p>U-_.298 </p> "
}

# At search time
POST /test_index/_analyze?pretty
{
    "analyzer": "articul_analyzer",
    "text": "U29"
}

Vladimir_Talabko · July 26, 2023, 7:58am

Thank you, but edge_ngram won't help me because hyphens might be several times anywhere in my texts.
When I was indexing my texts I set mappings like so:

$workParams['mappings'] = [
    'properties' => [
        "articul_indexed" => [
            'type' => 'text',
            'analyzer' => 'articul_analyzer'
        ]
    ]
];

Doesn't this mapping say to the engine to tokenyze my text?

dadoonet · July 26, 2023, 8:15am

Did you try my example with hyphens?
If so, please share what works and what does not as a full example which can be ran in Kibana Dev Console like I provided.

Vladimir_Talabko · July 26, 2023, 8:45am

Yes, it works for the "U-298" example. However, if I index text like "aaaaaa-uuuuuu", the engine even can't find the "aaaaa" (or "uuu") part from it. =(
I just want to be able to delete some symbols (-_.,) from a string, and then search a request (also without such symbols) starting from any symbol in my string.

dadoonet · July 26, 2023, 11:37am

Could you illustrate that with a full example please?

Vladimir_Talabko · July 26, 2023, 3:06pm

curl -X DELETE "localhost:9200/test_index?pretty"

curl -X PUT "localhost:9200/test_index?pretty" -H 'Content-Type: application/json' -d'
{
    "settings": {
        "analysis": {
            "char_filter": {
                "specials_char_filter": {
                    "type": "mapping",
                    "mappings": [ "-=>", ".=>", "_=>" ]
                }
            },
            "analyzer": {
                "articul_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "char_filter": [
                        "html_strip", "specials_char_filter"
                    ],
                    "filter": [
                        "lowercase",
                        "trim"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
          "articul_indexed": {
            "type": "text",
            "analyzer": "articul_analyzer"
          }
        }
      }
}
'

curl -X POST "localhost:9200/test_index/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "analyzer": "articul_analyzer",
    "text": "<p>U-_.298 </p> "
}
'

Here I see: "token" : "u298"
Perfect!

Two records:

curl -X PUT "localhost:9200/test_index/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
    "articul_indexed": "U-298"
}
'

curl -X PUT "localhost:9200/test_index/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
    "articul_indexed": "aaaaa-uuuuu"
}
'

and two tests:

curl -X GET "localhost:9200/test_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
        "query": "articul_indexed:U298"
    }
  }
}
'

This one works fine!

curl -X GET "localhost:9200/test_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
        "query": "articul_indexed:aaaaa"
    }
  }
}
'

This one doesn't =(

Christian_Dahlqvist · July 26, 2023, 3:31pm

Try replacing the special characters with whitespace instead of removing them in your analyzer.

Vladimir_Talabko · July 27, 2023, 6:22am

I use "tokenizer": "whitespace",

Christian_Dahlqvist · July 27, 2023, 6:27am

Yes, and that needs whitespace to work, which is why I suggested replacing with space rather than remove the characters.

When you remove the characters aaaaa-uuuuu will be tokenised as aaaaauuuuu, which means you can not search for either component. If you instead replace with space the whitespace tokenizer will tokenize it as aaaaa and uuuuu.

Vladimir_Talabko · July 27, 2023, 7:47am

If aaaaa-uuuuu will be tokenised as aaaaauuuuu I can search any type as aaa, uuu, aauu, isn't it?
If I get separately aaaaa and uuuuu I won't be able to find aauu for example.
I just want to dismiss some characters from vendor codes because people don't type them at most, but I want to show them results regardless aaaaa, uuuuu, or aauu. Only the order is matter.

p.s. As the next step, by allowing 1-2 fuzzyness symbols I'm going to expand this feature.

Christian_Dahlqvist · July 27, 2023, 8:01am

Not necessarily. It depends on the mapping of the field. You could find substrings like in your example, but that would require a wildcard query, which is one of the most expensive and inefficient query type you can use in Elasticsearch.

If this is how you want to query your data, you might want to look into the wildcard field type in order to make these queries more efficient.

system · August 24, 2023, 8:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Character filter analyzer not working Elasticsearch	2	436	December 6, 2021
Char_filter and lowercase not working for search_analyzer Elasticsearch	3	907	September 11, 2019
How to get char_filter to work? Elasticsearch	14	1150	July 6, 2017
Custom analyzer not working Elasticsearch	2	648	July 31, 2018
ElasticSearch won't recongize char_filter mapping Elasticsearch	6	1060	July 6, 2017

Char_filter doesn't work properly

Related topics