Search for special characters


(R01K) #1

What is the best combination of token filters+tokenizer+analyzer to search on special characters present in my document?


(Val Crettaz) #2

It would be better if you provide some sample data and explain how you want to search it.


(R01K) #3

Hi @val

I have field in my document which normally contains special characters in it for example
my_field = "$file123#.txt"
my_field = "$office@location&home.txt"

now the requirement is if user search for special chars like $ , #, @, & etc. so he should get required my_field in the result

I have tried it using ngram tokenizer but it is giving some irrelevant search result if i search for normal text ,

so now requirement here is if user search for special characters or numbers or text he should get relevant search result.

Kindly help to solve this


(Val Crettaz) #4

Can you show what you tried with ngrams?


(R01K) #5

I have used below custom analyzer

PUT my_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "custom_analyzer": {
            "type": "custom",
            "tokenizer": "my_tokenizer"
          }
        },
        "tokenizer": {
          "my_tokenizer": {
            "type": "ngram",
            "min_gram": 1,
            "max_gram": 10
          }
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "my_field": {
          "type": "text",
          "analyzer": "custom_analyzer",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/doc/2 
{
  "my_field":"$title123.txt"
}

PUT my_index/doc/1
{
  "my_field":"$titan@123#.txt"
}

Here special character search is working fine but if i search for titan i'm getting both the documents (my_field with title in it) which is irrelevant search


(Val Crettaz) #6

Great start! Try not to use your custom analyzer at search time, otherwise the search terms will get analyzed as well, i.e. titan will get tokenized to t, ti, tit, tita and titan and of course the first three tokens will match title too.

Add "search_analyzer" => "standard" to your field and it should already work better

    "my_field": {
      "type": "text",
      "analyzer": "custom_analyzer",
      "search_analyzer": "standard",           <--- add this
      "fields": {
        "keyword": {
          "ignore_above": 256,
          "type": "keyword"
        }
      }
    }

Note that you need to re-create and re-index your data first.


(Val Crettaz) #7

Also note that standard analyzer I'm suggesting might not be a good fit for searching special characters. You might want to create another custom analyzer but without the ngram tokenizer. The key point is not to use ngrams at search time.


(R01K) #8

Sure , Thanks @val !!


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.