Drop stopword's token position completely

Is there an existing way/setting of dropping stopword's token position completely?

Base on my understanding of ES stopword token filter, it removes the stopword, but leaves an empty token position. The implication of it for phrase match is that something else have to fill in the spot.

For example:

If I have "de" and "la" are in my stopword list, and I index following phrase:
"jardins près de la plage"

then I can only get a phrase match if the query is in the form of
"jardins près plage"

Considering french has many variations of these terms like "des", "le", "les" etc., it would be really nice to just drop them completely at indexing time and apply the same filter at the search time.

Any advice on this will be greatly appreciated. If all fails I will attempt to write a customized stopword token filter for it.

Here's a brief update on my investigation of stop word filter:

  1. ES build-in stop token filter works well with indexing a stop word while setting phrase match slop to 1. If I index "jardins près de plage", then phrase match works with "jardins près plage" and "jardins près de la plage". It now comes down to if I can live with slop of 1 for all phrase match queries.

  2. Custom Stop token filter: https://github.com/maoning/elasticsearch-remove-token-filter/blob/master/src/main/java/elasticsearch.remove/RemoveTokenFilter.java

It works, it simply drops the stop words as if they are not in the token stream. However, the trade-off is that customized plugin means more future maintenance work for each major ES upgrade.

Stop word filter has always been an unanswered question for me, I still don't quite know why lucene decided to keep stop word token positions...however I hope my investigation is somewhat helpful for people having similar questions.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.