I have a text field that represents a file name, this field follow a specific format. It contains several parts separated by underscore (_) one part could have letters, numbers and dashes (-) and it ends with a dot (.) and an extension. For example: X_Y-B_Z.ext.
I want to be able to search this field by each of its parts however the "-" preventing this behavior. I think I should use some custom analyzer/ tokenizer to solve this but I am not sure how exactly I should fix it.
Yes, you could use a custom analyzer with a tokenizer that just breaks on those characters that you want to break on (in your case _
and .
). The Char Group tokenizer is perhaps the most straightforward choice.
Here's a little example:
PUT my_index
{
"settings": {
"analysis": {
"tokenizer": {
"my_tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"_",
"."
]
}
},
"analyzer": {
"my_analyzer": {
"char_filter": [],
"tokenizer": "my_tokenizer",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
GET my_index/_analyze
{
"text": "X_Y-B_Z.ext",
"analyzer": "my_analyzer"
}
PUT my_index/_doc/1
{
"my_field": "X_Y-B_Z.ext"
}
GET my_index/_search
{
"query": {
"match": {
"my_field": "y-b"
}
}
}
If you don't need case-insensitive search you can remove the lowercase
token filter from the analyzer.
Thanks for the suggested solution. This solved the porblem mentioned in the post very well, however, I forgot to mention that X, Y, B, Z in the example I gave could be a set of alphanumeric characters like ABC01_DG-102_102_203.ext
andABC01_DG-102_102_204.ext
.
With the provided solution if I search for: "ABC01_DG-102_102_20*
" in Kibana searchbar I expected to get both ABC01_DG-102_102_203.ext
and ABC01_DG-102_102_204.ext
however, I got no results.
How can I modify this solution to handle that case as well?
I used abdon's solution with a small modification.
I solved the issue by adding only "." in the "tokenize_on_chars" .
PUT my_index
{
"settings": {
"analysis": {
"tokenizer": {
"my_tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"."
]
}
},
"analyzer": {
"my_analyzer": {
"char_filter": [],
"tokenizer": "my_tokenizer",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.