Right now, I'm able to create a working stopword filter in the following way.
company_name_stopword = ["inc","corp"]
_company_name_stopword_filter = dsl.token_filter("_company_name_stopword_filter",
type="stop",
ignore_case = "true",
stopwords=company_name_stopword
)
_tag_analyzer = dsl.analyzer('tag_analyzer',
tokenizer="whitespace",
filter=["lowercase",_company_name_stopword_filter]
)
response = _tag_analyzer.simulate(
text = 'Apple Inc',
using = localhost)
print([t.token for t in response.tokens])
The objective is to remove from text
input and string in the list company_name_stopword
.
However, the index I'm using, uses tokenizer="keyword"
instead of tokenizer="whitespace"
.
Is it possible to create this filter with a keyword tokenizer? I can't change the tokenizer of the index without incurring into potential large behavioural/performance changes...