Tokenizer works after Synonym (graph) token filter too?

taichi · August 3, 2021, 9:53am

I'm using Elasticsearch 7.7 on AWS Elasticsearch Service.

I thought token filters works after tokenizers and tokenizers do nothing after token filters.
(like Tokenizers = > Token filters => END)
But it seems tokenizers work again after Synonym graph token filters handle synonyms.
I wonder if it workds like Tokenizers = > Token filters => Tokenizers? => END.

I tested like this:

GET /index/_analyze
{
  "tokenizer": "standard",
  "filter": [{"type": "synonym_graph", "synonyms":["brown fox => brown fox,black cat"]}], 
  "text": "brown fox"
}

and got brown, black, fox, cat.

{
    "tokens": [
        {
            "token": "brown",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "black",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 0,
            "positionLength": 2
        },
        {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 1,
            "positionLength": 2
        },
        {
            "token": "cat",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 2
        }
    ]
}

I expected I get only brown, fox because brown fox in the synonyms does not exists in the output of standard tokenizer.

Another example:

{
  "tokenizer": "standard",
  "filter": [{"type": "synonym_graph", "synonyms":["fox => fox,black cat"]}], 
  "text": "fox"
}

returned fox, black, cat but I expected fox, black cat because I thought tokenizer does not work after token filter to tokenize black cat.

Could you teach me the exact order in which tokenizers and token filters work?

vincenbr · August 3, 2021, 10:06pm

Hi taichi,
a text analyzer operates in sequence. it has (in order):

0 or more character filters
exactly one tokenizer
0 or more token filters, applied in order
in your case, the synonym graph token filter is the last operation.
I suggest you to read about token graph to better understand how token filters work: Token graphs | Elasticsearch Guide [7.14] | Elastic

Furthermore, the synonyms in synonym_graph are tokenized and analyzed with the chain which precedes this filter. That explains why "brown fox" synonym is applied. see:
[Synonym graph token filter | Elasticsearch Guide [7.14] | Elastic]

taichi · August 8, 2021, 4:21pm

Thank you!

Furthermore, the synonyms in synonym_graph are tokenized and analyzed with the chain which precedes this filter

I almost understood that but a little confused. The docs say

Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file

It just says it applies preceding token filters. Does it apply preceding tokenizers to the synonym words too?
In other words, I wonder if the "standard" tokenizer is applied to the synonym words (like brown fox and black cat).

system · September 5, 2021, 4:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Synonym token filter didn't work properly with Chinese Elasticsearch	1	663	July 5, 2017
Elasticsearch synonym_graph filter not giving all tokens Elasticsearch	1	365	November 6, 2020
Multiplexer with synonyms doesn't work as expected Elasticsearch	0	81	June 6, 2024
Synonym graph token filter backed by Elastic index Elasticsearch	1	231	March 27, 2023
Confusion in using synonym token filter in elasticsearch 6.x Elasticsearch	8	1550	May 18, 2018

Tokenizer works after Synonym (graph) token filter too?

Related topics