Char_filter doesn't work properly

Hello!
I have an index with a char_filter for avoiding special symbols like "-_.":

curl -X PUT "localhost:9200/test_index?pretty" -H 'Content-Type: application/json' -d'
{
    "settings": {
        "analysis": {
            "char_filter": {
                "specials_char_filter": {
                    "type": "mapping",
                    "mappings": [ "- =>", ". =>", "_ =>" ]
                }
            },
            "analyzer": {
                "articul_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "char_filter": [
                        "html_strip", "specials_char_filter"
                    ],
                    "filter": [
                        "lowercase",
                        "trim"
                    ]
                }
            }
        }
    }
}
'

I've checked the analyzer with the next command:

curl -X POST "localhost:9200/test_index/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "analyzer": "articul_analyzer",
    "text": "<p>U-_.298 </p> "
}
'

It works, I'm getting "token" : "u298",

After indexing a few records I'm trying the next search request:

curl -X GET "localhost:9200/test_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
        "query": "U29",
        "default_field": "articul_indexed"
    }
  }
}
'

And it doesn't work. =(( It works for U-289, 289, and others, but for U298 doesn't.
Could you give me some clue how to get it worked?

Hi @Vladimir_Talabko

Did you try to use wildcards?

It's because u29 has not been indexed as a token but u298 so this can not match.

If you want to do some prefix search, you could add edge n grams to your analyzer:

DELETE /test_index
PUT /test_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "specials_char_filter": {
          "type": "mapping",
          "mappings": [
            "- =>",
            ". =>",
            "_ =>"
          ]
        }
      },
      "filter": {
        "prefix": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 4
        }
      },
      "analyzer": {
        "articul_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "char_filter": [
            "html_strip",
            "specials_char_filter"
          ],
          "filter": [
            "lowercase",
            "trim"
          ]
        },
        "articul_analyzer_prefix": {
          "type": "custom",
          "tokenizer": "whitespace",
          "char_filter": [
            "html_strip",
            "specials_char_filter"
          ],
          "filter": [
            "lowercase",
            "trim",
            "prefix"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "foo": {
        "type": "text",
        "analyzer": "articul_analyzer_prefix",
        "search_analyzer": "articul_analyzer"
      }
    }
  }
}

# At index time
POST /test_index/_analyze?pretty
{
    "analyzer": "articul_analyzer_prefix",
    "text": "<p>U-_.298 </p> "
}

# At search time
POST /test_index/_analyze?pretty
{
    "analyzer": "articul_analyzer",
    "text": "U29"
}

Thank you, but edge_ngram won't help me because hyphens might be several times anywhere in my texts.
When I was indexing my texts I set mappings like so:

$workParams['mappings'] = [
    'properties' => [
        "articul_indexed" => [
            'type' => 'text',
            'analyzer' => 'articul_analyzer'
        ]
    ]
];

Doesn't this mapping say to the engine to tokenyze my text?

Did you try my example with hyphens?
If so, please share what works and what does not as a full example which can be ran in Kibana Dev Console like I provided.

Yes, it works for the "U-298" example. However, if I index text like "aaaaaa-uuuuuu", the engine even can't find the "aaaaa" (or "uuu") part from it. =(
I just want to be able to delete some symbols (-_.,) from a string, and then search a request (also without such symbols) starting from any symbol in my string.

Could you illustrate that with a full example please?

curl -X DELETE "localhost:9200/test_index?pretty"

curl -X PUT "localhost:9200/test_index?pretty" -H 'Content-Type: application/json' -d'
{
    "settings": {
        "analysis": {
            "char_filter": {
                "specials_char_filter": {
                    "type": "mapping",
                    "mappings": [ "-=>", ".=>", "_=>" ]
                }
            },
            "analyzer": {
                "articul_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "char_filter": [
                        "html_strip", "specials_char_filter"
                    ],
                    "filter": [
                        "lowercase",
                        "trim"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
          "articul_indexed": {
            "type": "text",
            "analyzer": "articul_analyzer"
          }
        }
      }
}
'

curl -X POST "localhost:9200/test_index/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "analyzer": "articul_analyzer",
    "text": "<p>U-_.298 </p> "
}
'

Here I see: "token" : "u298"
Perfect!

Two records:

curl -X PUT "localhost:9200/test_index/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
    "articul_indexed": "U-298"
}
'

curl -X PUT "localhost:9200/test_index/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
    "articul_indexed": "aaaaa-uuuuu"
}
'

and two tests:

curl -X GET "localhost:9200/test_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
        "query": "articul_indexed:U298"
    }
  }
}
'

This one works fine!

curl -X GET "localhost:9200/test_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
        "query": "articul_indexed:aaaaa"
    }
  }
}
'

This one doesn't =(

Try replacing the special characters with whitespace instead of removing them in your analyzer.

1 Like

I use "tokenizer": "whitespace",

Yes, and that needs whitespace to work, which is why I suggested replacing with space rather than remove the characters.

When you remove the characters aaaaa-uuuuu will be tokenised as aaaaauuuuu, which means you can not search for either component. If you instead replace with space the whitespace tokenizer will tokenize it as aaaaa and uuuuu.

If aaaaa-uuuuu will be tokenised as aaaaauuuuu I can search any type as aaa, uuu, aauu, isn't it?
If I get separately aaaaa and uuuuu I won't be able to find aauu for example.
I just want to dismiss some characters from vendor codes because people don't type them at most, but I want to show them results regardless aaaaa, uuuuu, or aauu. Only the order is matter.

p.s. As the next step, by allowing 1-2 fuzzyness symbols I'm going to expand this feature.

Not necessarily. It depends on the mapping of the field. You could find substrings like in your example, but that would require a wildcard query, which is one of the most expensive and inefficient query type you can use in Elasticsearch.

If this is how you want to query your data, you might want to look into the wildcard field type in order to make these queries more efficient.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.