0.90 String character encoding now UTF-8 - can that be explicitly set?

Hi,

In watching the "What's new in Elasticsearch 0.90?" webinar, one change is
that String data is now encoded in UTF-8, which for many languages results
in a space saving, since ASCII encodes to a single byte, rather than two in
a UCS-2 encoding.

However, for some character sets, notably CJK, UTF-8 encodes to 3 or
sometimes 4 bytes. So, for a site indexing primarily CJK (and some other
Asian / Hangul), the String storage will INCREASE by a good amount (50% or
so).

Is there a way (or will there be a way) to specify the character encoding
to use for String type fields? I looked through the Elasticsearch guide,
and couldn't see anything on install / setup / mapping / index create...

Thanks!

Bob.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Robert,

On Tue, May 7, 2013 at 2:47 PM, Robert Sandiford bobsandiford@gmail.com wrote:

In watching the "What's new in Elasticsearch 0.90?" webinar, one change is
that String data is now encoded in UTF-8, which for many languages results
in a space saving, since ASCII encodes to a single byte, rather than two in
a UCS-2 encoding.

However, for some character sets, notably CJK, UTF-8 encodes to 3 or
sometimes 4 bytes. So, for a site indexing primarily CJK (and some other
Asian / Hangul), the String storage will INCREASE by a good amount (50% or
so).

Is there a way (or will there be a way) to specify the character encoding to
use for String type fields? I looked through the Elasticsearch guide, and
couldn't see anything on install / setup / mapping / index create...

It is possible to change the underlying encoding by making your
analysis chain use a custom implementation of Lucene's
TermToBytesRefAttribute[1]. Please however beware that it is a
super-expert API that requires a good understanding of Lucene analysis
internals.

[1] http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/analysis/tokenattributes/TermToBytesRefAttribute.html

--
Adrien

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

OK, Thanks Adrien. We'll work initially with the UTF-8 as delivered, and
keep an eye on things...

Bob.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

UTF-8 is the input encoding and output encoding, so I believe that 0.90.0
just kept it as UTF-8 internally to avoid many unnecessary conversions to
and from the fixed-width String form.

The only other realistic (IMHO) internal / on-disk encoding that is smaller
than UTF-8 for the cases you mention is compressed Unicode. But that would
also require additional conversions and performance would suffer. As it is,
ES 0.90.0's default of compression on disk means that the resulting UTF-8
is greatly offset by other space-saving functions.

Also note that UTF-8 goes to 2-byte sequences for all non-ASCII characters,
such as the accented ISO-8859-1 characters between 0x0080 and 0x00FF, that
are used by most European countries. But again, the extra space is offset
by space savings in other areas and by performance gains.

On Tuesday, May 7, 2013 8:47:49 AM UTC-4, Robert Sandiford wrote:

Hi,

In watching the "What's new in Elasticsearch 0.90?" webinar, one change is
that String data is now encoded in UTF-8, which for many languages results
in a space saving, since ASCII encodes to a single byte, rather than two in
a UCS-2 encoding.

However, for some character sets, notably CJK, UTF-8 encodes to 3 or
sometimes 4 bytes. So, for a site indexing primarily CJK (and some other
Asian / Hangul), the String storage will INCREASE by a good amount (50% or
so).

Is there a way (or will there be a way) to specify the character encoding
to use for String type fields? I looked through the Elasticsearch guide,
and couldn't see anything on install / setup / mapping / index create...

Thanks!

Bob.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.