Character group tokenizer in ElasticSearch

Hello, I want to implement Character group tokenizer in elasticsearch. How Do I implement an index with char_group tokenizer.
I am putting this setting in my index:

{
  "index": {
    "analysis": {
      "number_of_shards": "1",
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "char_group",
          "tokenize_on_chars": [
            "whitespace",
            "-",
            ",",
            ":",
            "\n"
          ]
        }
      }
    }
  }
}

My Index mapping:

{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "Id": {
        "type": "long"
      },
      "Name": {
        "type": "search_as_you_type",
        "doc_values": false,
        "max_shingle_size": 3
      },
      "Name_chargroup": {
        "type": "text",
        "analyzer": "my_analyzer"
      },
      "tags": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

My query:

{
    
    "size":200,
    "query":{
        "multi_match":{
            "query":"gss info",
            "type":"most_fields",
            "fields":["Name_chargroup "],
            "operator": "and"
            }
        }
        }

The result coming as null...

The document Present is Name_chargroup : "Gss Infotech"

Hi @Rakhshunda_Noorein_J

You need to add a "lowercase" filter to lowercase the terms.

"my_analyzer": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "lowercase"
          ]
        }

Hello, Now the result is coming but not as expected as char_group tokenizer works.

my field mapping:

 "Name_chargroup": {
                    "type": "text",
                    "analyzer": "my_analyzer"
                },

My document is: Gss InfoTech my search term: Info Gss

result for charGroup tokenizer:

POST _analyze
{
  "tokenizer": {
    "type": "char_group",
    "tokenize_on_chars": [
      "whitespace",
      "-",
      "\n"
    ]
  },
  "text": "Info Gss"
}

response:
{
    "tokens": [
        {
            "token": "Info",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "Gss",
            "start_offset": 5,
            "end_offset": 8,
            "type": "word",
            "position": 1
        }
    ]
}

but my rersult coming as null. On the other hand Gss Info and Infotech GSS is getting the result but not Info Gss

Why don't you use the "standard" tokenizer?

if you index "Gss InfoTech" and the search term is "gss info" and your query Match with operator "AND' you will not have results because "infotech != info".
If you remove the "and" the match will be on the "gss" token.
If you want to apply the match with the term "info" you will have to use the edge_ngram tokenizer.

1 Like

previously I had used edge_ngram tokenizer. But the problem happening with it is -
when I am searching suppose : Information , I had given max_gram:10 and min_gram:3, so it is breaking information as inf, info, infr,...like that. Because of that, Information is coming below info. Meaning.. Document Like InfoEdge, Infotech coming first that Information Technology, which I dont want.

So for this reason I wanted a tokenizer which will break the words when a whitespace is encounter.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.