Choose Correct Text Analyzer/ Tokenizer

I have a text field that represents a file name, this field follow a specific format. It contains several parts separated by underscore (_) one part could have letters, numbers and dashes (-) and it ends with a dot (.) and an extension. For example: X_Y-B_Z.ext.
I want to be able to search this field by each of its parts however the "-" preventing this behavior. I think I should use some custom analyzer/ tokenizer to solve this but I am not sure how exactly I should fix it.

Yes, you could use a custom analyzer with a tokenizer that just breaks on those characters that you want to break on (in your case _ and . ). The Char Group tokenizer is perhaps the most straightforward choice.

Here's a little example:

PUT my_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_tokenizer": {
          "type": "char_group",
          "tokenize_on_chars": [
            "_",
            "."
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "char_filter": [],
          "tokenizer": "my_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

GET my_index/_analyze
{
  "text": "X_Y-B_Z.ext",
  "analyzer": "my_analyzer"
}

PUT my_index/_doc/1
{
  "my_field": "X_Y-B_Z.ext"
}

GET my_index/_search
{
 "query": {
   "match": {
     "my_field": "y-b"
   }
 } 
}

If you don't need case-insensitive search you can remove the lowercase token filter from the analyzer.

2 Likes

Thanks for the suggested solution. This solved the porblem mentioned in the post very well, however, I forgot to mention that X, Y, B, Z in the example I gave could be a set of alphanumeric characters like ABC01_DG-102_102_203.ext andABC01_DG-102_102_204.ext.

With the provided solution if I search for: "ABC01_DG-102_102_20*" in Kibana searchbar I expected to get both ABC01_DG-102_102_203.ext and ABC01_DG-102_102_204.ext however, I got no results.
How can I modify this solution to handle that case as well?

I used abdon's solution with a small modification.
I solved the issue by adding only "." in the "tokenize_on_chars" .

PUT my_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_tokenizer": {
          "type": "char_group",
          "tokenize_on_chars": [
            "."
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "char_filter": [],
          "tokenizer": "my_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "my_field": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.