Language and HTML analyzer

Karthik_Ramachandran · April 12, 2016, 1:44am

I need more clarity on language analyzer and html filtering. My content sometimes come within html tags, that I need to strip out during indexing. Also, it varies by language. I create mapping for each language and have to use appropriate analyzer. How do I combine these?

For ex. I get English Content with or without HTML tags, I get Spanish Content with or without HTML tags. I need to index only the actual content. I also assume language specific analyzer do consider English tokens by default. Because, my content do contain English sentences though classified to be some other language...

Thanks for help..

warkolm · April 12, 2016, 7:00am

It sounds like you need to do a bit of filtering before hand and send different languages into different indices, with different analysers.

Once they are in (eg) english and spanish language indices you can then just run your analysers.

Karthik_Ramachandran · April 12, 2016, 7:42pm

Thanks Mark.

Should I send different languages to different Indices? I thought of using type - mappings for each language withing same index? Can't I have the analyzers with types?

Also, w.r.t html tags, i thought of using html_strip charfilter. Won't it help.
URL Referred: http://stackoverflow.com/questions/18780346/html-strip-in-elastic-search

warkolm · April 12, 2016, 10:09pm

I would, it just keeps the logical domains cleaner and lets you play with analysis on a per language basis.

Topic		Replies	Views
Extend built-in analyzers Elasticsearch	8	1332	July 5, 2018
How do I use "lang" analyzers? Actually, should I use them? Elasticsearch	4	350	July 6, 2017
HTML Filter - How do I use it in a search? Elasticsearch	5	567	March 16, 2018
Multiple Languages against single attribute Elasticsearch	5	1873	July 5, 2017
Multilingual index options: _analyzer or multiple mappings or? Elasticsearch	2	623	July 6, 2017

Language and HTML analyzer

Related topics