Doing some preliminary research on whether or not we want to use
ElasticSearch to replace some really old searching/profiling (i.e.
percolating) software we use -- Verity Search. This might seem like
somewhat of a naive question, as I'm suspicious of everything coming out of
the deep hole in the ground that is the unsupported and thoroughly
deprecated piece of software that is Verity, but... *does
ElasticSearch(/Lucene) support unicode out of the box? *I've read the
section of the docs on the ICU plugin, and between that and what's floating
around this list, it seems pretty obvious that it does support it; I'm
curious if anyone can confirm this with a flat yes or no, and also if it's
a pluggin feature (via the ICU plugin) or native to ElasticSearch itself.
I'm guessing it's native, while the ICU plugin gives you some of the
i18n/L10n features usually needed to fully leverage foreign languages, but
again, just looking for some confirmation.
On Tuesday, October 16, 2012 7:30:59 PM UTC+2, Adam Georgiou wrote:
Hey Elasticsearch,
Doing some preliminary research on whether or not we want to use
Elasticsearch to replace some really old searching/profiling (i.e.
percolating) software we use -- Verity Search. This might seem like
somewhat of a naive question, as I'm suspicious of everything coming out of
the deep hole in the ground that is the unsupported and thoroughly
deprecated piece of software that is Verity, but... *does
Elasticsearch(/Lucene) support unicode out of the box? *I've read the
section of the docs on the ICU plugin, and between that and what's floating
around this list, it seems pretty obvious that it does support it; I'm
curious if anyone can confirm this with a flat yes or no, and also if it's
a pluggin feature (via the ICU plugin) or native to Elasticsearch itself.
I'm guessing it's native, while the ICU plugin gives you some of the
i18n/L10n features usually needed to fully leverage foreign languages, but
again, just looking for some confirmation.
Elasticsearch & Lucene has excellent unicode support. All lucene analyzers,
tokenziers and tokenfilters have fully Unicode 4.0 support including
supplementary characters, everything in the BMP etc. etc. With ICU you have
Unicode 6 (in lucene we recently upgraded to 6.1 even) that should give you
all the tools you need. I personally use lucene & es for search engines
that search across 150+ different languages, scripts etc. The file formats
use UTF-8 anyway so you really should not be worried about this!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.