How do I extend the default analyzer?

jongpyo.lee · May 7, 2021, 8:29am

This is example data.

POST coding/_bulk
{"index":{"_id":"1"}}
{"language":"xyz_foo@abc"}

I confirmed that the defulat analyzer distinguishes @ but not the _ (underscore)symbol through _termvectors.

GET coding/_termvectors/1?fields=language
{
  "_index" : "coding",
  ...
  "term_vectors" : {
    "language" : {
      ...
      "terms" : {
        "abc" : {
           ...
        },
        "xyz_foo" : {
          ...
        }
      }
    }
  }
}

The default analyzer didn't distinguish between the _ (underscore)symbols, so I couldn't search with xyz or foo.
How do I create an analyzer that can search up to xyz or foo and abc by separating the _ (underscore)symbol?

spinscale · May 10, 2021, 9:10am

Hey,

you need to find the proper tokenizer in order to split tokens. See this example

GET _analyze
{
  "text": ["xyz_foo@abc"],
  "tokenizer": "letter"
}

So the analyze API allows you to figure out how the tokens are tokenized and modified before saved in the inverted index. Take a look at Tokenizer reference | Elasticsearch Guide [7.12] | Elastic and check out which tokenizer might be for you. The char group tokenizer might be something for you as well, to come up with your own set of characaters to tokenize on Character group tokenizer | Elasticsearch Guide [7.12] | Elastic

system · June 7, 2021, 9:10am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Choose Correct Text Analyzer/ Tokenizer Elasticsearch	4	592	July 17, 2019
Custom analyzer and char_group tokenizer - can't search for terms with dot Elasticsearch	1	882	February 1, 2019
Default analyzers in elastic search Elasticsearch	2	834	July 5, 2017
Adding filter to existing analyzer Elasticsearch	4	903	July 6, 2017
Override built-in analyzer Elasticsearch	6	459	July 6, 2017

How do I extend the default analyzer?

Related topics