Tokenizer works after Synonym (graph) token filter too?

I'm using Elasticsearch 7.7 on AWS Elasticsearch Service.

I thought token filters works after tokenizers and tokenizers do nothing after token filters.
(like Tokenizers = > Token filters => END)
But it seems tokenizers work again after Synonym graph token filters handle synonyms.
I wonder if it workds like Tokenizers = > Token filters => Tokenizers? => END.

I tested like this:

GET /index/_analyze
{
  "tokenizer": "standard",
  "filter": [{"type": "synonym_graph", "synonyms":["brown fox => brown fox,black cat"]}], 
  "text": "brown fox"
}

and got brown, black, fox, cat.

{
    "tokens": [
        {
            "token": "brown",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "black",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 0,
            "positionLength": 2
        },
        {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 1,
            "positionLength": 2
        },
        {
            "token": "cat",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 2
        }
    ]
}

I expected I get only brown, fox because brown fox in the synonyms does not exists in the output of standard tokenizer.

Another example:

{
  "tokenizer": "standard",
  "filter": [{"type": "synonym_graph", "synonyms":["fox => fox,black cat"]}], 
  "text": "fox"
}

returned fox, black, cat but I expected fox, black cat because I thought tokenizer does not work after token filter to tokenize black cat.

Could you teach me the exact order in which tokenizers and token filters work?

Hi taichi,
a text analyzer operates in sequence. it has (in order):

  • 0 or more character filters
  • exactly one tokenizer
  • 0 or more token filters, applied in order
    in your case, the synonym graph token filter is the last operation.
    I suggest you to read about token graph to better understand how token filters work: Token graphs | Elasticsearch Guide [7.14] | Elastic

Furthermore, the synonyms in synonym_graph are tokenized and analyzed with the chain which precedes this filter. That explains why "brown fox" synonym is applied. see:
[Synonym graph token filter | Elasticsearch Guide [7.14] | Elastic]

2 Likes

Thank you!

Furthermore, the synonyms in synonym_graph are tokenized and analyzed with the chain which precedes this filter

I almost understood that but a little confused. The docs say

Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file

It just says it applies preceding token filters. Does it apply preceding tokenizers to the synonym words too?
In other words, I wonder if the "standard" tokenizer is applied to the synonym words (like brown fox and black cat).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.