How do I extend the default analyzer?

This is example data.

POST coding/_bulk
{"index":{"_id":"1"}}
{"language":"xyz_foo@abc"}

I confirmed that the defulat analyzer distinguishes @ but not the _ (underscore)symbol through _termvectors.

GET coding/_termvectors/1?fields=language
{
  "_index" : "coding",
  ...
  "term_vectors" : {
    "language" : {
      ...
      "terms" : {
        "abc" : {
           ...
        },
        "xyz_foo" : {
          ...
        }
      }
    }
  }
}

The default analyzer didn't distinguish between the _ (underscore)symbols, so I couldn't search with xyz or foo.
How do I create an analyzer that can search up to xyz or foo and abc by separating the _ (underscore)symbol?

Hey,

you need to find the proper tokenizer in order to split tokens. See this example

GET _analyze
{
  "text": ["xyz_foo@abc"],
  "tokenizer": "letter"
}

So the analyze API allows you to figure out how the tokens are tokenized and modified before saved in the inverted index. Take a look at Tokenizer reference | Elasticsearch Guide [7.12] | Elastic and check out which tokenizer might be for you. The char group tokenizer might be something for you as well, to come up with your own set of characaters to tokenize on Character group tokenizer | Elasticsearch Guide [7.12] | Elastic

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.