Search for special characters

What is the best combination of token filters+tokenizer+analyzer to search on special characters present in my document?

1 Like

It would be better if you provide some sample data and explain how you want to search it.

Hi @val

I have field in my document which normally contains special characters in it for example
my_field = "$file123#.txt"
my_field = "$office@location&home.txt"

now the requirement is if user search for special chars like $ , #, @, & etc. so he should get required my_field in the result

I have tried it using ngram tokenizer but it is giving some irrelevant search result if i search for normal text ,

so now requirement here is if user search for special characters or numbers or text he should get relevant search result.

Kindly help to solve this

Can you show what you tried with ngrams?

I have used below custom analyzer

PUT my_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "custom_analyzer": {
            "type": "custom",
            "tokenizer": "my_tokenizer"
          }
        },
        "tokenizer": {
          "my_tokenizer": {
            "type": "ngram",
            "min_gram": 1,
            "max_gram": 10
          }
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "my_field": {
          "type": "text",
          "analyzer": "custom_analyzer",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/doc/2 
{
  "my_field":"$title123.txt"
}

PUT my_index/doc/1
{
  "my_field":"$titan@123#.txt"
}

Here special character search is working fine but if i search for titan i'm getting both the documents (my_field with title in it) which is irrelevant search

Great start! Try not to use your custom analyzer at search time, otherwise the search terms will get analyzed as well, i.e. titan will get tokenized to t, ti, tit, tita and titan and of course the first three tokens will match title too.

Add "search_analyzer" => "standard" to your field and it should already work better

    "my_field": {
      "type": "text",
      "analyzer": "custom_analyzer",
      "search_analyzer": "standard",           <--- add this
      "fields": {
        "keyword": {
          "ignore_above": 256,
          "type": "keyword"
        }
      }
    }

Note that you need to re-create and re-index your data first.

Also note that standard analyzer I'm suggesting might not be a good fit for searching special characters. You might want to create another custom analyzer but without the ngram tokenizer. The key point is not to use ngrams at search time.

1 Like

Sure , Thanks @val !!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.