Issue with Shingles and Stopwords

khmcnally · November 6, 2018, 10:21am

Hi there,

I have an issue with getting shingles and stopwords to play nicely. I seem to duplicate tokens all over the place.

Here's my analyzer:

      "analysis": {
        "filter": {
          "filter_shingle": {
            "type": "shingle",
            "max_shingle_size": 3,
            "min_shingle_size": 2,
            "filler_token": "",
            "output_unigrams": True
          },
          "filter_stop": {
            "type": "stop",
            "stopwords": stopwords
          }
        },
        "analyzer": {
          "analyzer_shingle": {
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase",
              "filter_stop",
              "filter_shingle",
              "trim"
            ]
          }
        }
      }

Now, when I analyze this I get lots of repeating tokens. For example, the following

    GET article_search_production/_analyze
    {
      "analyzer" : "analyzer_shingle",
      "text" : "buy a car"
    }

outputs this long list of tokens:

    {
      "tokens": [
        {
          "token": "buy",
          "start_offset": 0,
          "end_offset": 3,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "buy",
          "start_offset": 0,
          "end_offset": 6,
          "type": "shingle",
          "position": 0,
          "positionLength": 2
        },
        {
          "token": "buy  car",
          "start_offset": 0,
          "end_offset": 9,
          "type": "shingle",
          "position": 0,
          "positionLength": 3
        },
        {
          "token": "car",
          "start_offset": 6,
          "end_offset": 9,
          "type": "shingle",
          "position": 1,
          "positionLength": 2
        },
        {
          "token": "car",
          "start_offset": 6,
          "end_offset": 9,
          "type": "<ALPHANUM>",
          "position": 2
        }
      ]
    }

I get why this is happening -- my analyzer is removing the stopwords and replacing them with empty strings, and then the shingle filter treats these empty strings as bona fide unigrams. That's not desired behaviour though.

I'm wondering is there a way to ensure there are no duplicated tokens in the query? I assume this affects how ES scores each stored document (for example if the term "buy" appears in a document it will be matched twice).

I've already applied the fillter_token parameter and trim filters as suggested by kind folk in this forum, but I'm stuck with this problem. Any further help would be greatly appreciated!

AlanWoodward · November 21, 2018, 8:58am

If you're using shingles to help speed up phrase searching, then you're almost certainly better off using the 'index_phrases' option on your text fields - this handles all the analysis for you, and will correctly deal with stopwords by falling back to a normal phrase query if any of the query terms are removed.

system · December 19, 2018, 8:58am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issue when combining shingle filter and stopwords Elasticsearch	2	950	July 5, 2017
Using shingle and stop filters Elasticsearch	2	369	July 6, 2020
Problem with shingles as an autocomplete solution Elasticsearch	5	2560	July 6, 2017
Fuzzy searching on shingles filter getting problem Elasticsearch	1	640	November 6, 2018
Fuzzy searching on shingles filter getting problem for search Elasticsearch	1	412	November 9, 2018

Issue with Shingles and Stopwords

Related topics