Unicode Support; Newbie Looking for Clarification

Adam_Georgiou · October 16, 2012, 5:30pm

Hey Elasticsearch,

Doing some preliminary research on whether or not we want to use
ElasticSearch to replace some really old searching/profiling (i.e.
percolating) software we use -- Verity Search. This might seem like
somewhat of a naive question, as I'm suspicious of everything coming out of
the deep hole in the ground that is the unsupported and thoroughly
deprecated piece of software that is Verity, but... *does
ElasticSearch(/Lucene) support unicode out of the box? *I've read the
section of the docs on the ICU plugin, and between that and what's floating
around this list, it seems pretty obvious that it does support it; I'm
curious if anyone can confirm this with a flat yes or no, and also if it's
a pluggin feature (via the ICU plugin) or native to ElasticSearch itself.
I'm guessing it's native, while the ICU plugin gives you some of the
i18n/L10n features usually needed to fully leverage foreign languages, but
again, just looking for some confirmation.

Also wondering how responsive this list is...

Thanks,
Adam

--

simonw_2 · October 16, 2012, 9:07pm

hey adam,

On Tuesday, October 16, 2012 7:30:59 PM UTC+2, Adam Georgiou wrote:

Hey Elasticsearch,

Doing some preliminary research on whether or not we want to use
Elasticsearch to replace some really old searching/profiling (i.e.
percolating) software we use -- Verity Search. This might seem like
somewhat of a naive question, as I'm suspicious of everything coming out of
the deep hole in the ground that is the unsupported and thoroughly
deprecated piece of software that is Verity, but... *does
Elasticsearch(/Lucene) support unicode out of the box? *I've read the
section of the docs on the ICU plugin, and between that and what's floating
around this list, it seems pretty obvious that it does support it; I'm
curious if anyone can confirm this with a flat yes or no, and also if it's
a pluggin feature (via the ICU plugin) or native to Elasticsearch itself.
I'm guessing it's native, while the ICU plugin gives you some of the
i18n/L10n features usually needed to fully leverage foreign languages, but
again, just looking for some confirmation.

Elasticsearch & Lucene has excellent unicode support. All lucene analyzers,
tokenziers and tokenfilters have fully Unicode 4.0 support including
supplementary characters, everything in the BMP etc. etc. With ICU you have
Unicode 6 (in lucene we recently upgraded to 6.1 even) that should give you
all the tools you need. I personally use lucene & es for search engines
that search across 150+ different languages, scripts etc. The file formats
use UTF-8 anyway so you really should not be worried about this!

Also wondering how responsive this list is...

pretty responsive I guess

simon

Thanks,
Adam

--