How does the ES to handle strings with different encodings?


(Ivan Ji) #1

Hi all,

I am using python to communicate with ES. I know the ES use UTF8 encoding.

But when I insert a unicode into ES, does it try to encode it into UTF8?

Because I encounter an situation, I insert two documents into ES. The first
document contains a unicode string ex. u'\u611b'.
And the second one has the same content of the first one, but the
encoding of the string is utf8, ex. '\xe6\x84\x9b'.

And no matter what encoding of string that I use in search, I can both get
the two hits. It seems excellent but I am wondering why?
Does ES normalize the encoding when indexing and querying? Or how can it
find the results with different encoding?

Any idea?

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dae9f8c9-da23-4752-8a0b-18bda2cbefbd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Honza Král) #2

Hi Ivan,

the python client should accept both strings and unicode though I
highly recommend you always use unicode to avoid encoding issues. When
it is sent to the server we always encode the data in UTF-8 so both
these strings will end up encoded the same.

To see exactly what's going on you can simple enable logging of
individual requests to see the json being sent to the server, to do
that just do:

import logging
tracer = logging.getLogger('elasticsearch.trace')
tracer.setLevel(logging.INFO)
tracer.addHandler(logging.FileHandler('/tmp/es_trace.log'))

then in /tmp/es_trace.log you will have a transcript of your session
as curl commands.

Hope this helps,
Honza

On Mon, Mar 3, 2014 at 10:11 AM, Ivan Ji hxuanji@gmail.com wrote:

Hi all,

I am using python to communicate with ES. I know the ES use UTF8 encoding.

But when I insert a unicode into ES, does it try to encode it into UTF8?

Because I encounter an situation, I insert two documents into ES. The first
document contains a unicode string ex. u'\u611b'.
And the second one has the same content of the first one, but the encoding
of the string is utf8, ex. '\xe6\x84\x9b'.

And no matter what encoding of string that I use in search, I can both get
the two hits. It seems excellent but I am wondering why?
Does ES normalize the encoding when indexing and querying? Or how can it
find the results with different encoding?

Any idea?

Ivan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/dae9f8c9-da23-4752-8a0b-18bda2cbefbd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CABfdDirUVmOoXaFMpcxXCegjaYtfyy2wX%2BrEL9hCG_%3DeqSrFPQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3