Treatment of special characters in elasticsearch

I use the following analyzer:

curl -XPUT 'http://localhost:9200/sample/' -d '
{
"settings" : {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]}
}
}
}
}
}'

Then when I try to insert some documents which contain special characters
like % and etc, it converts in to hex.

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8
-> actual value

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8

-> stored value.

Sample:

curl -XPUT 'http://localhost:9200/sample/strom/1' -d '{
"user" : "user1",
"message" : "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}'

The problem started occurring only once the data crossed some million
documents. Earlier it used store it as it is.

Now if I try to search using,

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8

it is not able to retrieve the document. How do I deal with this? The
behavior seems to non-deterministic in converting special character to hex.

I am unable to replicate the same issue on localmachine.

Can someone explain the mistake I am making?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6c8dff11-8ab4-4acf-8e85-4b4c93b270f7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Check your client. If you use curl, and shell, it's the shell or curl that
is handling characters, for example, URI percent encoding.

Elasticsearch, when it has received data, does not do any extra conversion,
it expects UTF-8.

Jörg

On Sat, Nov 22, 2014 at 7:05 PM, prachicsa@gmail.com wrote:

I use the following analyzer:

curl -XPUT 'http://localhost:9200/sample/' -d '
{
"settings" : {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]}
}
}
}
}
}'

Then when I try to insert some documents which contain special characters
like % and etc, it converts in to hex.

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8
-> actual value

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8

-> stored value.

Sample:

curl -XPUT 'http://localhost:9200/sample/strom/1' -d '{
"user" : "user1",
"message" : "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}'

The problem started occurring only once the data crossed some million
documents. Earlier it used store it as it is.

Now if I try to search using,

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8

it is not able to retrieve the document. How do I deal with this? The
behavior seems to non-deterministic in converting special character to hex.

I am unable to replicate the same issue on localmachine.

Can someone explain the mistake I am making?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6c8dff11-8ab4-4acf-8e85-4b4c93b270f7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6c8dff11-8ab4-4acf-8e85-4b4c93b270f7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHJE_Ld2V5ya%2BcqVMg-vJi7FhLcJDO6w%3DoWch7obRiE1A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I am using Java Transport client. Where do we have to specify about
handling characters there?

On Sunday, November 23, 2014 2:15:52 AM UTC+5:30, Jörg Prante wrote:

Check your client. If you use curl, and shell, it's the shell or curl that
is handling characters, for example, URI percent encoding.

Elasticsearch, when it has received data, does not do any extra
conversion, it expects UTF-8.

Jörg

On Sat, Nov 22, 2014 at 7:05 PM, <prac...@gmail.com <javascript:>> wrote:

I use the following analyzer:

curl -XPUT 'http://localhost:9200/sample/' -d '
{
"settings" : {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]}
}
}
}
}
}'

Then when I try to insert some documents which contain special characters
like % and etc, it converts in to hex.

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8
-> actual value

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8

-> stored value.

Sample:

curl -XPUT 'http://localhost:9200/sample/strom/1' -d '{
"user" : "user1",
"message" : "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}'

The problem started occurring only once the data crossed some million
documents. Earlier it used store it as it is.

Now if I try to search using,

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8

it is not able to retrieve the document. How do I deal with this? The
behavior seems to non-deterministic in converting special character to hex.

I am unable to replicate the same issue on localmachine.

Can someone explain the mistake I am making?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6c8dff11-8ab4-4acf-8e85-4b4c93b270f7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6c8dff11-8ab4-4acf-8e85-4b4c93b270f7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/32de1fd0-6caf-4bb6-b0aa-d37041a46384%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Then you have an error in your program.

I tried it and it works. There is no character handling, except lowercase
filter.

See:

Jörg

On Sun, Nov 23, 2014 at 4:27 AM, prachicsa@gmail.com wrote:

I am using Java Transport client. Where do we have to specify about
handling characters there?

On Sunday, November 23, 2014 2:15:52 AM UTC+5:30, Jörg Prante wrote:

Check your client. If you use curl, and shell, it's the shell or curl
that is handling characters, for example, URI percent encoding.

Elasticsearch, when it has received data, does not do any extra
conversion, it expects UTF-8.

Jörg

On Sat, Nov 22, 2014 at 7:05 PM, prac...@gmail.com wrote:

I use the following analyzer:

curl -XPUT 'http://localhost:9200/sample/' -d '
{
"settings" : {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]}
}
}
}
}
}'

Then when I try to insert some documents which contain special
characters like % and etc, it converts in to hex.

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIv
Y54iMiX_149c95f02a8 -> actual value

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8

-> stored value.

Sample:

curl -XPUT 'http://localhost:9200/sample/strom/1' -d '{
"user" : "user1",
"message" : "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}'

The problem started occurring only once the data crossed some million
documents. Earlier it used store it as it is.

Now if I try to search using,

1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8

it is not able to retrieve the document. How do I deal with this? The
behavior seems to non-deterministic in converting special character to hex.

I am unable to replicate the same issue on localmachine.

Can someone explain the mistake I am making?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/6c8dff11-8ab4-4acf-8e85-4b4c93b270f7%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6c8dff11-8ab4-4acf-8e85-4b4c93b270f7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/32de1fd0-6caf-4bb6-b0aa-d37041a46384%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/32de1fd0-6caf-4bb6-b0aa-d37041a46384%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFc6FBDVL-nU2wojLGdV0%2Bzrg4Pd%3DMT5x48X%2B-nhsuZhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.