Using shingles after delimiter_graph returns weird results

Hi! :smiley:

When using a combination of delimiter_graph and shingles, I do not get the results that I expect.
Consider the following use case:

DELETE test
PUT test
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "delimited_shingles": {
          "tokenizer": "whitespace",
          "filter": [
            "trim",
            "asciifolding",
            "delimiter",
            "lowercase",
            "shingle"
          ]
        }
      },
      "filter": {
        "delimiter": {
          "type": "word_delimiter_graph",
          "type_table": [
            "$ => DIGIT",
            "& => DIGIT",
            "% => DIGIT",
            "\\u002C => DIGIT",
            "\\u002B => DIGIT"
          ],
          "stem_english_possessive": "false",
          "catenate_all": "true",
          "catenate_words": "true",
          "preserve_original": "true",
          "generate_word_parts": "false",
          "split_on_numerics": "true",
          "split_on_case_change": "false"
        },
        "shingle": {
          "max_shingle_size": "2",
          "min_shingle_size": "2",
          "token_separator": "_",
          "output_unigrams": "false",
          "type": "shingle"
        }
      }
    }
  }
}
GET test/_analyze
{
  "explain": true, 
  "analyzer": "delimited_shingles",
  "text": [
      "grey's anatomy"
    ]
}

#Expected:
#greys_anatomy
#grey's_anatomy

#Actual:
#greys_grey's
#greys_anatomy

Based on start_offset, end_offset, position and position_length, I would expect the shingle analyzer to create shingles out of the correct output from the delimiter.
Is this because non-graph analyzers cannot handle this kind of output? Is this something that would have to be fixed in Lucene or do I have wrong expectations?

In case more info is needed, let me know! :slight_smile:
Thanks in advance,

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.