Can language analyzers be configured to use char_filters and token_filters?


(Robin Hughes) #1

Hi

I have documents in many languages containing basic html, that need to be
searched in a case insensitive, ascii-folded manner.

Is it possible to use the standard language analyzers from
http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html
(in addition to plugins such as the smart chinese and stempel analyzers) in
conjunction with the html_strip char_filter, lowercase and asciifolding
token_filters?

As far as I can tell this isn't possible by config alone, but would love to
be proved wrong.

Thanks,
Robin


(Ivan Brusic) #2

Hi Robin,

You can always re-create the analyzer from scratch using a custom
analyzer. Language analyzers are analyzers with a language specific
stemmer filter. Not hard to do in ElasticSearch.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/stemmer-tokenfilter.html

I have never used a language analyzer, but I would assume it does
lowercase and asciifolding already. At least the former.

Ivan

On Thu, May 31, 2012 at 7:13 AM, Robin Hughes robinhughes@fastmail.fm wrote:

Hi

I have documents in many languages containing basic html, that need to be
searched in a case insensitive, ascii-folded manner.

Is it possible to use the standard language analyzers from
http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html
(in addition to plugins such as the smart chinese and stempel analyzers) in
conjunction with the html_strip char_filter, lowercase and asciifolding
token_filters?

As far as I can tell this isn't possible by config alone, but would love to
be proved wrong.

Thanks,
Robin


(Robin Hughes) #3

Thanks for your help.

That certainly covers a lot of languages. It looks like some (Polish, Smart
Chinese, Thai) will need a bit of extra work.

Thanks again,

Robin.


(system) #4