Stopwords file format


(Eugene Strokin) #1

I want to specify my own stop-words. This is what I found so far:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/stop-tokenfilter.html
In elasticsearch.yml I'd have such analyzer specified:

index :
analysis:
analyzer:
string_lowercase:
tokenizer : keyword
filter : lowercase
stopwords_path : stopwords.txt
ignore_case : true

How should I specify the stop-words in the stopwords.txt file? Just a
word in a line, or somehow else?

Also, I don't care which language users will use to index data, so if
I'd put stopwords from different languages into the same file, it
should be no problem, but should I use just UTF-8 encoding, or should
I use encoding like we use in .properties files, e.q. "de art
\u00edculos"?

Thank you,
Eugene S.


(Shay Banon) #2

Each stop word should be in its own "line" (separated by \n). The file is
read in UTF8 format.

On Fri, Dec 23, 2011 at 4:42 AM, Eugene Strokin eugene@strokin.info wrote:

I want to specify my own stop-words. This is what I found so far:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/stop-tokenfilter.html
In elasticsearch.yml I'd have such analyzer specified:

index :
analysis:
analyzer:
string_lowercase:
tokenizer : keyword
filter : lowercase
stopwords_path : stopwords.txt
ignore_case : true

How should I specify the stop-words in the stopwords.txt file? Just a
word in a line, or somehow else?

Also, I don't care which language users will use to index data, so if
I'd put stopwords from different languages into the same file, it
should be no problem, but should I use just UTF-8 encoding, or should
I use encoding like we use in .properties files, e.q. "de art
\u00edculos"?

Thank you,
Eugene S.


(system) #3