Hi there,
I have an issue with getting shingles and stopwords to play nicely. I seem to duplicate tokens all over the place.
Here's my analyzer:
"analysis": {
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 3,
"min_shingle_size": 2,
"filler_token": "",
"output_unigrams": True
},
"filter_stop": {
"type": "stop",
"stopwords": stopwords
}
},
"analyzer": {
"analyzer_shingle": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"filter_stop",
"filter_shingle",
"trim"
]
}
}
}
Now, when I analyze this I get lots of repeating tokens. For example, the following
GET article_search_production/_analyze
{
"analyzer" : "analyzer_shingle",
"text" : "buy a car"
}
outputs this long list of tokens:
{
"tokens": [
{
"token": "buy",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "buy",
"start_offset": 0,
"end_offset": 6,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "buy car",
"start_offset": 0,
"end_offset": 9,
"type": "shingle",
"position": 0,
"positionLength": 3
},
{
"token": "car",
"start_offset": 6,
"end_offset": 9,
"type": "shingle",
"position": 1,
"positionLength": 2
},
{
"token": "car",
"start_offset": 6,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
}
]
}
I get why this is happening -- my analyzer is removing the stopwords and replacing them with empty strings, and then the shingle filter treats these empty strings as bona fide unigrams. That's not desired behaviour though.
I'm wondering is there a way to ensure there are no duplicated tokens in the query? I assume this affects how ES scores each stored document (for example if the term "buy" appears in a document it will be matched twice).
I've already applied the fillter_token parameter and trim filters as suggested by kind folk in this forum, but I'm stuck with this problem. Any further help would be greatly appreciated!