Where to find default stopword lists?


(Patrick Lam) #1

Hi,

Is there a single place where we can find the default stopwords used by ES/Lucene for every single language available? I've found bits and pieces here and there but I can't locate a single place where the implemented default lists for every language are available and up to date.

Thanks


(tri-man) #2

You can set up ES to use external stopword list file(s) so you can add/remove words as you see fit with your data. Certainly I suggest you to start with the default list comes from ES until you see something that does not seem to work with your data, then try the custom list using external file.

Here is the link where you can get a decent set of stopword lists for different languages to start with: http://www.ranks.nl/stopwords


(Patrick Lam) #3

Thanks but my question was not about custom implementations of stopwords. It was purely a simple question about where I can find ES/Lucene's actual specific default lists for all the languages that they have defaults for.


(tri-man) #4

You can check lucene-analyzers-common-<version>.jar


(Isabel Drost-Fromm) #5

I didn't find a pretty version in the docs. But you can always dig through the original code starting here:

(Note: stop words for some of the languages that don't have an explicit folder are stored under snowball.)

Hope this helps,
Isabel


(system) #6