0.90 String character encoding now UTF-8 - can that be explicitly set?

Bob_Sandiford · May 7, 2013, 12:47pm

Hi,

In watching the "What's new in Elasticsearch 0.90?" webinar, one change is
that String data is now encoded in UTF-8, which for many languages results
in a space saving, since ASCII encodes to a single byte, rather than two in
a UCS-2 encoding.

However, for some character sets, notably CJK, UTF-8 encodes to 3 or
sometimes 4 bytes. So, for a site indexing primarily CJK (and some other
Asian / Hangul), the String storage will INCREASE by a good amount (50% or
so).

Is there a way (or will there be a way) to specify the character encoding
to use for String type fields? I looked through the Elasticsearch guide,
and couldn't see anything on install / setup / mapping / index create...

Thanks!

Bob.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Adrien_Grand · May 7, 2013, 1:57pm

Hi Robert,

On Tue, May 7, 2013 at 2:47 PM, Robert Sandiford bobsandiford@gmail.com wrote:

In watching the "What's new in Elasticsearch 0.90?" webinar, one change is
that String data is now encoded in UTF-8, which for many languages results
in a space saving, since ASCII encodes to a single byte, rather than two in
a UCS-2 encoding.

However, for some character sets, notably CJK, UTF-8 encodes to 3 or
sometimes 4 bytes. So, for a site indexing primarily CJK (and some other
Asian / Hangul), the String storage will INCREASE by a good amount (50% or
so).

Is there a way (or will there be a way) to specify the character encoding to
use for String type fields? I looked through the Elasticsearch guide, and
couldn't see anything on install / setup / mapping / index create...

It is possible to change the underlying encoding by making your
analysis chain use a custom implementation of Lucene's
TermToBytesRefAttribute[1]. Please however beware that it is a
super-expert API that requires a good understanding of Lucene analysis
internals.

[1] TermToBytesRefAttribute (Lucene 4.3.0 API)

--
Adrien

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bob_Sandiford · May 7, 2013, 2:34pm

OK, Thanks Adrien. We'll work initially with the UTF-8 as delivered, and
keep an eye on things...

Bob.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · May 7, 2013, 4:39pm

UTF-8 is the input encoding and output encoding, so I believe that 0.90.0
just kept it as UTF-8 internally to avoid many unnecessary conversions to
and from the fixed-width String form.

The only other realistic (IMHO) internal / on-disk encoding that is smaller
than UTF-8 for the cases you mention is compressed Unicode. But that would
also require additional conversions and performance would suffer. As it is,
ES 0.90.0's default of compression on disk means that the resulting UTF-8
is greatly offset by other space-saving functions.

Also note that UTF-8 goes to 2-byte sequences for all non-ASCII characters,
such as the accented ISO-8859-1 characters between 0x0080 and 0x00FF, that
are used by most European countries. But again, the extra space is offset
by space savings in other areas and by performance gains.

On Tuesday, May 7, 2013 8:47:49 AM UTC-4, Robert Sandiford wrote:

Hi,

In watching the "What's new in Elasticsearch 0.90?" webinar, one change is
that String data is now encoded in UTF-8, which for many languages results
in a space saving, since ASCII encodes to a single byte, rather than two in
a UCS-2 encoding.

However, for some character sets, notably CJK, UTF-8 encodes to 3 or
sometimes 4 bytes. So, for a site indexing primarily CJK (and some other
Asian / Hangul), the String storage will INCREASE by a good amount (50% or
so).

Is there a way (or will there be a way) to specify the character encoding
to use for String type fields? I looked through the Elasticsearch guide,
and couldn't see anything on install / setup / mapping / index create...

Thanks!

Bob.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Configure Charset used by Elasticsearch to minimize storage and boost searches Elasticsearch	2	255	October 19, 2021
How does the ES to handle strings with different encodings? Elasticsearch	2	3864	July 6, 2017
Content encoding issues Elasticsearch	4	1291	July 6, 2017
What kind of encoding does the ES support? Elasticsearch	4	20281	July 6, 2017
Convert existing character encoding in ElasticSearch Elasticsearch	1	3638	June 19, 2019

0.90 String character encoding now UTF-8 - can that be explicitly set?

Related topics