What kind of encoding does the ES support?


(Ivan Ji) #1

Hi all,

I am wondering about the encoding of ES. What kind of the encoding is of
the ES storage? And what's the encoding during the operations?

Through the REST API, any json document can be sent to insert into ES. So
if I sent an document as follows

curl -XPOST 'http://192.168.50.7:9200/data/main/' -d '{"name":"蒼天",
"type":"file", "extension":"tmp", "mime_type": "application/text"}'

As we can see the "name" field is not ascii and assume my native encoding
is UTF8. What happened during the insertion?
Does it store the original string which is utf8 inside into ES? If the
native encoding is not common, ex. Latin, what would happen?

Ideas?

Regards

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/07c0ba7b-a5c5-494b-9894-28803c519d34%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Ji) #2

Hey all,

I just performed several tests.

Assume I have the word "愛", its utf8 coding is "\xe6\x84\x9b", and its
unicode is '\u611b'.

I execute the following command:

curl -XPOST 'http://192.168.50.7:9200/data/main" -d '{"name":"愛"}'

--> {"_index":"data","_type":"main","_id":"bFPGRGFaTcqsS1hOfWVjDQ","_version":1,"created":true}

curl -XPOST 'http://192.168.50.7:9200/qusion/main' -d '{"name":"\u611b"}'

--> {"_index":"data","_type":"main","_id":"NntYqlwRQl6QZ3ROAclU5w","_version":1,"created":true}

curl -XPOST 'http://192.168.50.7:9200/qusion/main' -d
'{"name":"\xe6\x84\x9b"}'

--> {"error":"MapperParsingException[failed to parse [name]]; nested:
JsonParseException[Unrecognized character escape 'x' (code 120)\n at
[Source: [B@b41c580; line: 1, column: 12]]; ","status":400}

It seems obvious that json parse exception occur. But what's the difference
between '{"name":"\xe6\x84\x9b"}' and '{"name":"愛"}' ?
PS. My native lang is utf8.

And next I try to find them by using :

curl 'http://192.168.50.7:9200/data/main/_search?pretty' -d
'{"query":{"term":{"name": "愛"}}}'

I think it might get only one hit. But it get the two hits. And why? It
seem there are some transformation inside it.

Ideas?

Thanks a lot.

Ivan Ji於 2014年3月4日星期二UTC+8上午11時57分33秒寫道:

Hi all,

I am wondering about the encoding of ES. What kind of the encoding is of
the ES storage? And what's the encoding during the operations?

Through the REST API, any json document can be sent to insert into ES. So
if I sent an document as follows

curl -XPOST 'http://192.168.50.7:9200/data/main/' -d '{"name":"蒼天",
"type":"file", "extension":"tmp", "mime_type": "application/text"}'

As we can see the "name" field is not ascii and assume my native encoding
is UTF8. What happened during the insertion?
Does it store the original string which is utf8 inside into ES? If the
native encoding is not common, ex. Latin, what would happen?

Ideas?

Regards

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f0b10c56-0e88-4670-8d1f-a9b6576f1543%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Ji) #3

Hey all,

I just performed several tests.

Assume I have the word "愛", its utf8 coding is "\xe6\x84\x9b", and its
unicode is '\u611b'.

I execute the following command:

curl -XPOST 'http://192.168.50.7:9200/data/main" -d '{"name":"愛"}'

--> {"_index":"data","_type":"main","_id":"bFPGRGFaTcqsS1hOfWVjDQ","_version":1,"created":true}

curl -XPOST 'http://192.168.50.7:9200/data/main' -d '{"name":"\u611b"}'

--> {"_index":"data","_type":"main","_id":"NntYqlwRQl6QZ3ROAclU5w","_version":1,"created":true}

curl -XPOST 'http://192.168.50.7:9200/data/main' -d
'{"name":"\xe6\x84\x9b"}'

--> {"error":"MapperParsingException[failed to parse [name]]; nested:
JsonParseException[Unrecognized character escape 'x' (code 120)\n at
[Source: [B@b41c580; line: 1, column: 12]]; ","status":400}

It seems obvious that json parse exception occur. But what's the difference
between '{"name":"\xe6\x84\x9b"}' and '{"name":"愛"}' ?
PS. My native lang is utf8.

And next I try to find them by using :

curl 'http://192.168.50.7:9200/data/main/_search?pretty' -d
'{"query":{"term":{"name": "愛"}}}'

I think it might get only one hit. But it get the two hits. And why? It
seem there are some transformation inside it.

Ideas?

Thanks a lot.

Ivan Ji於 2014年3月4日星期二UTC+8上午11時57分33秒寫道:

Hi all,

I am wondering about the encoding of ES. What kind of the encoding is of
the ES storage? And what's the encoding during the operations?

Through the REST API, any json document can be sent to insert into ES. So
if I sent an document as follows

curl -XPOST 'http://192.168.50.7:9200/data/main/' -d '{"name":"蒼天",
"type":"file", "extension":"tmp", "mime_type": "application/text"}'

As we can see the "name" field is not ascii and assume my native encoding
is UTF8. What happened during the insertion?
Does it store the original string which is utf8 inside into ES? If the
native encoding is not common, ex. Latin, what would happen?

Ideas?

Regards

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/593a6fbe-7e37-4f82-b706-f92bc0785d41%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #4

Ivan,

Yes, ES stores all strings in UTF-8 encoding.

Referring to your 3 POST commands, the first two succeeded because in the
first one, you presented the data in the UTF-8 encoding and it was
accepted. In the second one, you presented the same name but in using the
\u notation which is valid, supported, and equivalent in value.

And since you created the documents and let ES automatically assign the id,
you have 2 spearate documents, and both have the same name. Hence your
final query matches them both. No mystery there.

Your third command failed because the \x notation is not valid JSON.
Looking at http://www.json.org/ it can be seen that the JSON specification
accepts \u notation but not the \x notation. And from the error message,
the (most excellent) Jackson library follows the specification closely on
this.

Hope this helps!

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fa1fc5dc-3f52-4863-92e5-ec06f6abcadc%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5