Synonym token graphs and Shingles don't play well together

Silas_Hansen · May 1, 2020, 1:18pm

Hi

I know I'm not the first one to touch on this issue, but no solution seems to have been provided.

Expected behavior:
With the synonym [world,earth] and the phrase "hello world", sending them through a shingle filter of size 2, I would expect the following terms to be produced.

"hello world"
"hello earth"

since "world" and "earth" share the same position due to the graph nature of the synonyms filter.

Actual behavior:
However, it seems the Shingle filter is not looking at the position (graph) information of the terms produced by the synonyms_graph filter, but instead just sees it as a flat list of [hello, earth, world], outputting the shingles:

"hello earth"
"earth world"

Why should it work now?
I've read that the index_phrase feature uses a lucene filter called FixedShingleFilter to get around this limitation.

The index_phrase feature does not cover my needs for bigram-matching, so I'd like to be able to roll my own analyzer.

I'm currently testing with ES 7.6.2, running Lucene 8.4.0.
The FixedShingleFilter issue shows that the functionality was added back in Lucene 7.4.

So my question is:
can I access the FixedShingleFilter as well from settings? Or am I missing another obvious solution?

Thanks in advance!

Code for reference:

PUT synonyms_test
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "bigram_filter": {
            "max_shingle_size": "2",
            "min_shingle_size": "2",
            "token_separator": " ",
            "output_unigrams": "false",
            "type": "shingle"
          },
          "synonym_graph": {
            "type": "synonym_graph",
            "synonyms": [
              "world,earth"
            ]
          }
        },
        "analyzer": {
          "bigrams": {
            "filter": [
              "synonym_graph",
              "lowercase",
              "bigram_filter"
            ],
            "type": "custom",
            "tokenizer": "standard"
          }
        }
      }
    }
  }
}


GET synonyms_test/_analyze
{
  "analyzer": "bigrams", 
  "text": "hello world"
}

Output:

{
  "tokens" : [
    {
      "token" : "hello earth",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "earth world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "shingle",
      "position" : 1
    }
  ]
}

Silas_Hansen · May 7, 2020, 12:41pm

Are there really no one who can answer or provide an update on this?

system · June 4, 2020, 12:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shingles filter with synonym filter? Elasticsearch	1	1285	July 6, 2017
Specifying analyzer for the synonym parser Elasticsearch	1	387	June 13, 2018
Generating shingles with synonyms in Elasticsearch Elasticsearch	1	468	February 17, 2022
How to generate shingles before synonyms filter? Elasticsearch	3	341	September 14, 2022
ES 5.4 synonyms and shingles don't seem to work together Elasticsearch	2	1204	May 13, 2018

Synonym token graphs and Shingles don't play well together

Related topics