How elasticsearch encodes strings with special characters before storing

Hi

We have input documents with special characters like % and _ as values.
When it gets stored in elasticsearch these special characters are replaced
with hex code equivalent.
eg.
X3dPVA9%252bZZjFLd864e7U1udCbHZhJ77amNcaGtV7Zp6dJwl3LM%252fd1cD8j8fh8spX_14978fa269e
is stored as
x3dpva9%2bzzjfld864e7u1udcbhzhj77amncagtv7zp6djwl3lm%2fd1cd8j8fh8spx_14978fa269e

but if the input string has only "_" as special character then the stored
string is the same as passed in string.

We would like to know (or get the api or code) of elasticsearch that
encodes the string with special characters so that we can use the same
while doing a search against these special characters.

thanks in advance

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1b82967c-5ddb-46a4-814f-8043b14de624%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It is the client that does the character handling. If you use HTTP
protocol, watch out for URI escaping the client uses. '%nn' denotes a byte
in URI escaping, see Percent-encoding - Wikipedia

Jörg

On Thu, Nov 6, 2014 at 7:46 AM, Shobana Neelakantan s.shobana24@gmail.com
wrote:

Hi

We have input documents with special characters like % and _ as values.
When it gets stored in elasticsearch these special characters are replaced
with hex code equivalent.
eg.

X3dPVA9%252bZZjFLd864e7U1udCbHZhJ77amNcaGtV7Zp6dJwl3LM%252fd1cD8j8fh8spX_14978fa269e
is stored as

x3dpva9%2bzzjfld864e7u1udcbhzhj77amncagtv7zp6djwl3lm%2fd1cd8j8fh8spx_14978fa269e

but if the input string has only "_" as special character then the stored
string is the same as passed in string.

We would like to know (or get the api or code) of elasticsearch that
encodes the string with special characters so that we can use the same
while doing a search against these special characters.

thanks in advance

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1b82967c-5ddb-46a4-814f-8043b14de624%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1b82967c-5ddb-46a4-814f-8043b14de624%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEhcJn-u4JXUw3O6mi1ymehMYJF_KH4dF0PXsCs2McmjQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Can you please help elaborate on what do you mean by "It is the client that
does the character handling" ?

I am part of the original poster's team and am working with her to crack
this problem.

The % is NOT coming up by any URI escaping but it is part of the actual
value being stored.

When a string contains both %(percentage symbol) and _(underscore), the
record is being stored in Elasticsearch after both % and _ are encoded in
Hexadecimal eight format.

But if the string contains only "_" then the record is being stored as is.

So what we would like to know is,

What is the logic that Elasticsearch uses to figure out when to encode a
particular text using hexadecimal 8 values and when to store a particular
string as is.

On Thursday, November 6, 2014 2:12:17 PM UTC+5:30, Jörg Prante wrote:

It is the client that does the character handling. If you use HTTP
protocol, watch out for URI escaping the client uses. '%nn' denotes a byte
in URI escaping, see Percent-encoding - Wikipedia

Jörg

On Thu, Nov 6, 2014 at 7:46 AM, Shobana Neelakantan <s.sho...@gmail.com
<javascript:>> wrote:

Hi

We have input documents with special characters like % and _ as values.
When it gets stored in elasticsearch these special characters are replaced
with hex code equivalent.
eg.

X3dPVA9%252bZZjFLd864e7U1udCbHZhJ77amNcaGtV7Zp6dJwl3LM%252fd1cD8j8fh8spX_14978fa269e
is stored as

x3dpva9%2bzzjfld864e7u1udcbhzhj77amncagtv7zp6djwl3lm%2fd1cd8j8fh8spx_14978fa269e

but if the input string has only "_" as special character then the stored
string is the same as passed in string.

We would like to know (or get the api or code) of elasticsearch that
encodes the string with special characters so that we can use the same
while doing a search against these special characters.

thanks in advance

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1b82967c-5ddb-46a4-814f-8043b14de624%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1b82967c-5ddb-46a4-814f-8043b14de624%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/70e0c950-f467-4da6-a7c9-843034326751%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Elasticsearch does not decode or encode anything, it accepts UTF-8 data.

If you can describe

  • the client you use or the program you load the data into ES

  • the protocol (I assume HTTP, but there is also the transport protocol for
    the native Java client)

then I might be able to find out more about the ES client behavior.

Jörg

On Fri, Nov 7, 2014 at 4:19 PM, Krishnan Mahadevan <
krishnan.mahadevan1978@gmail.com> wrote:

Can you please help elaborate on what do you mean by "It is the client
that does the character handling" ?

I am part of the original poster's team and am working with her to crack
this problem.

The % is NOT coming up by any URI escaping but it is part of the actual
value being stored.

When a string contains both %(percentage symbol) and _(underscore), the
record is being stored in Elasticsearch after both % and _ are encoded in
Hexadecimal eight format.

But if the string contains only "_" then the record is being stored as is.

So what we would like to know is,

What is the logic that Elasticsearch uses to figure out when to encode a
particular text using hexadecimal 8 values and when to store a particular
string as is.

On Thursday, November 6, 2014 2:12:17 PM UTC+5:30, Jörg Prante wrote:

It is the client that does the character handling. If you use HTTP
protocol, watch out for URI escaping the client uses. '%nn' denotes a byte
in URI escaping, see Percent-encoding - Wikipedia

Jörg

On Thu, Nov 6, 2014 at 7:46 AM, Shobana Neelakantan s.sho...@gmail.com
wrote:

Hi

We have input documents with special characters like % and _ as values.
When it gets stored in elasticsearch these special characters are replaced
with hex code equivalent.
eg.
X3dPVA9%252bZZjFLd864e7U1udCbHZhJ77amNcaGtV7Zp6dJwl3LM%
252fd1cD8j8fh8spX_14978fa269e
is stored as
x3dpva9%2bzzjfld864e7u1udcbhzhj77amncagtv7zp6djwl3lm%
2fd1cd8j8fh8spx_14978fa269e

but if the input string has only "_" as special character then the
stored string is the same as passed in string.

We would like to know (or get the api or code) of elasticsearch that
encodes the string with special characters so that we can use the same
while doing a search against these special characters.

thanks in advance

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/1b82967c-5ddb-46a4-814f-8043b14de624%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1b82967c-5ddb-46a4-814f-8043b14de624%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/70e0c950-f467-4da6-a7c9-843034326751%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/70e0c950-f467-4da6-a7c9-843034326751%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFOaX%3DEDtV2ne_t0NA3_Du5MLEQ4dDw-8x%2B53p%2B9b7eNw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

@jprante I faced same problem it seems sense is also doing byte coding.
if i am running query using curl -XPUT "http://localhost:9200/twitter/tweet/1?pretty" -d'
{"subscrname": "SìspatchS"}' it is throwing error Invalid UTF-8 middle byte 0x73\n at [Source: org.elasticsearch.common.io.stream.InputStreamStreamInput@554b3b1; line: 2, column: 20].

Can you please let me know how i can index for this string "SìspatchS" (Basically it is containing small i grave charset that is also unicode so it should not break).

Thanks,
Sumit