Synonym token graphs and Shingles don't play well together

Hi

I know I'm not the first one to touch on this issue, but no solution seems to have been provided.

Expected behavior:
With the synonym [world,earth] and the phrase "hello world", sending them through a shingle filter of size 2, I would expect the following terms to be produced.

"hello world"
"hello earth"

since "world" and "earth" share the same position due to the graph nature of the synonyms filter.

Actual behavior:
However, it seems the Shingle filter is not looking at the position (graph) information of the terms produced by the synonyms_graph filter, but instead just sees it as a flat list of [hello, earth, world], outputting the shingles:

"hello earth"
"earth world"

Why should it work now?
I've read that the index_phrase feature uses a lucene filter called FixedShingleFilter to get around this limitation.

The index_phrase feature does not cover my needs for bigram-matching, so I'd like to be able to roll my own analyzer.

I'm currently testing with ES 7.6.2, running Lucene 8.4.0.
The FixedShingleFilter issue shows that the functionality was added back in Lucene 7.4.

So my question is:
can I access the FixedShingleFilter as well from settings? Or am I missing another obvious solution?

Thanks in advance!


Code for reference:

PUT synonyms_test
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "bigram_filter": {
            "max_shingle_size": "2",
            "min_shingle_size": "2",
            "token_separator": " ",
            "output_unigrams": "false",
            "type": "shingle"
          },
          "synonym_graph": {
            "type": "synonym_graph",
            "synonyms": [
              "world,earth"
            ]
          }
        },
        "analyzer": {
          "bigrams": {
            "filter": [
              "synonym_graph",
              "lowercase",
              "bigram_filter"
            ],
            "type": "custom",
            "tokenizer": "standard"
          }
        }
      }
    }
  }
}


GET synonyms_test/_analyze
{
  "analyzer": "bigrams", 
  "text": "hello world"
}

Output:

{
  "tokens" : [
    {
      "token" : "hello earth",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "earth world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "shingle",
      "position" : 1
    }
  ]
}

Are there really no one who can answer or provide an update on this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.