Character group tokenizer in ElasticSearch

Rakhshunda_Noorein_J · June 16, 2023, 11:27am

Hello, I want to implement Character group tokenizer in elasticsearch. How Do I implement an index with char_group tokenizer.
I am putting this setting in my index:

{
  "index": {
    "analysis": {
      "number_of_shards": "1",
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "char_group",
          "tokenize_on_chars": [
            "whitespace",
            "-",
            ",",
            ":",
            "\n"
          ]
        }
      }
    }
  }
}

My Index mapping:

{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "Id": {
        "type": "long"
      },
      "Name": {
        "type": "search_as_you_type",
        "doc_values": false,
        "max_shingle_size": 3
      },
      "Name_chargroup": {
        "type": "text",
        "analyzer": "my_analyzer"
      },
      "tags": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

My query:

{
    
    "size":200,
    "query":{
        "multi_match":{
            "query":"gss info",
            "type":"most_fields",
            "fields":["Name_chargroup "],
            "operator": "and"
            }
        }
        }

The result coming as null...

The document Present is Name_chargroup : "Gss Infotech"

RabBit_BR · June 19, 2023, 2:31am

Hi @Rakhshunda_Noorein_J

You need to add a "lowercase" filter to lowercase the terms.

"my_analyzer": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "lowercase"
          ]
        }

Rakhshunda_Noorein_J · June 19, 2023, 11:38am

Hello, Now the result is coming but not as expected as char_group tokenizer works.

my field mapping:

 "Name_chargroup": {
                    "type": "text",
                    "analyzer": "my_analyzer"
                },

My document is: Gss InfoTech my search term: Info Gss

result for charGroup tokenizer:

POST _analyze
{
  "tokenizer": {
    "type": "char_group",
    "tokenize_on_chars": [
      "whitespace",
      "-",
      "\n"
    ]
  },
  "text": "Info Gss"
}

response:
{
    "tokens": [
        {
            "token": "Info",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "Gss",
            "start_offset": 5,
            "end_offset": 8,
            "type": "word",
            "position": 1
        }
    ]
}

but my rersult coming as null. On the other hand Gss Info and Infotech GSS is getting the result but not Info Gss

RabBit_BR · June 19, 2023, 12:37pm

Why don't you use the "standard" tokenizer?

if you index "Gss InfoTech" and the search term is "gss info" and your query Match with operator "AND' you will not have results because "infotech != info".
If you remove the "and" the match will be on the "gss" token.
If you want to apply the match with the term "info" you will have to use the edge_ngram tokenizer.

Rakhshunda_Noorein_J · June 19, 2023, 2:00pm

previously I had used edge_ngram tokenizer. But the problem happening with it is -
when I am searching suppose : Information , I had given max_gram:10 and min_gram:3, so it is breaking information as inf, info, infr,...like that. Because of that, Information is coming below info. Meaning.. Document Like InfoEdge, Infotech coming first that Information Technology, which I dont want.

So for this reason I wanted a tokenizer which will break the words when a whitespace is encounter.

system · July 17, 2023, 2:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom analyzer and char_group tokenizer - can't search for terms with dot Elasticsearch	1	882	February 1, 2019
Whitespace analyzer (char-filter And token-filter) Elasticsearch	7	1217	November 27, 2019
Character filter analyzer not working Elasticsearch	2	435	December 6, 2021
Choose Correct Text Analyzer/ Tokenizer Elasticsearch	4	592	July 17, 2019
Char_filter doesn't work properly Elasticsearch	13	225	August 24, 2023

Character group tokenizer in ElasticSearch

Related topics