I need more clarity on language analyzer and html filtering. My content sometimes come within html tags, that I need to strip out during indexing. Also, it varies by language. I create mapping for each language and have to use appropriate analyzer. How do I combine these?
For ex. I get English Content with or without HTML tags, I get Spanish Content with or without HTML tags. I need to index only the actual content. I also assume language specific analyzer do consider English tokens by default. Because, my content do contain English sentences though classified to be some other language...
Should I send different languages to different Indices? I thought of using type - mappings for each language withing same index? Can't I have the analyzers with types?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.