Issue when combining shingle filter and stopwords


(Mdebo) #1

Hi,
I encounter an issue with elasticsearch shingle filter. As the matter of fact, when i combine shingle filter and stopwords, it seems that the option output_unigrams, which I set to false, is no more taken into account.

This is the configuration of the analyzer I'm using:

        "shingle_french" => {
            "tokenizer" : "standard",
            "filter":  ["standard", "lowercase", "french_stop", "filter_shingle"]
        },

And the filters:

        "filter_shingle": {
          "type": "shingle",
          "max_shingle_size": 5,
          "min_shingle_size": 2,
          "output_unigrams": false,
          "filler_token": "",
          "output_unigrams_if_no_shingles": true
        },
        "french_stop": {
            "type": "stop",
            "stopwords": "_french_"
        }

When I analyse the tokens generated by the query "porte de garage", I have this result:

porte
start: 0 end: 9 pos: 1
porte garage
start: 0 end: 15 pos: 1
garage
start: 9 end: 15 pos: 2

However, I would like to only obtain one token: "porte garage"
What am I doing wrong?

Thank you in advance.


(Eduard Dudar) #2

"filler_token": "" does not works as you probably expect here. Your N-grams are "porte de", "porte de garage" and "de garage" where "de" is replaced with "" lead to "porte ", "porte garage" and " garage" effectively or something close to this.

I think that ES team actually expects that filler_token will be some meaningful but 'unusual' symbol so that "something <filler>" never fires. But that's just a suspicion rather than fact...


(system) #3