Issue with Shingles and Stopwords

Hi there,

I have an issue with getting shingles and stopwords to play nicely. I seem to duplicate tokens all over the place.

Here's my analyzer:

      "analysis": {
        "filter": {
          "filter_shingle": {
            "type": "shingle",
            "max_shingle_size": 3,
            "min_shingle_size": 2,
            "filler_token": "",
            "output_unigrams": True
          },
          "filter_stop": {
            "type": "stop",
            "stopwords": stopwords
          }
        },
        "analyzer": {
          "analyzer_shingle": {
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase",
              "filter_stop",
              "filter_shingle",
              "trim"
            ]
          }
        }
      }

Now, when I analyze this I get lots of repeating tokens. For example, the following

    GET article_search_production/_analyze
    {
      "analyzer" : "analyzer_shingle",
      "text" : "buy a car"
    }

outputs this long list of tokens:

    {
      "tokens": [
        {
          "token": "buy",
          "start_offset": 0,
          "end_offset": 3,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "buy",
          "start_offset": 0,
          "end_offset": 6,
          "type": "shingle",
          "position": 0,
          "positionLength": 2
        },
        {
          "token": "buy  car",
          "start_offset": 0,
          "end_offset": 9,
          "type": "shingle",
          "position": 0,
          "positionLength": 3
        },
        {
          "token": "car",
          "start_offset": 6,
          "end_offset": 9,
          "type": "shingle",
          "position": 1,
          "positionLength": 2
        },
        {
          "token": "car",
          "start_offset": 6,
          "end_offset": 9,
          "type": "<ALPHANUM>",
          "position": 2
        }
      ]
    }

I get why this is happening -- my analyzer is removing the stopwords and replacing them with empty strings, and then the shingle filter treats these empty strings as bona fide unigrams. That's not desired behaviour though.

I'm wondering is there a way to ensure there are no duplicated tokens in the query? I assume this affects how ES scores each stored document (for example if the term "buy" appears in a document it will be matched twice).

I've already applied the fillter_token parameter and trim filters as suggested by kind folk in this forum, but I'm stuck with this problem. Any further help would be greatly appreciated!

1 Like

If you're using shingles to help speed up phrase searching, then you're almost certainly better off using the 'index_phrases' option on your text fields - this handles all the analysis for you, and will correctly deal with stopwords by falling back to a normal phrase query if any of the query terms are removed.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.