How can I specify the UTF-8 codec when searching ES from python?

Peter_Trei · April 3, 2015, 6:13pm

This issue is probably due to my noobishness to ELK, Python, and Unicode.

I have an index containing logstash-digested logs, including a field
'host_req', which contains a host name.
Using Elasticsearch-py, I'm pulling that host name out of the record, and
using it to search in another index.
However, if the hostname contains multibyte characters, it fails with a
UnicodeDecodeError
Exactly the same query works fine when I enter it from the command line
with 'curl -XGET'
The unicode character is a lowercase 'a' with a diaeresis (two dots). The
UTF-8 value is C3 A4,
and the unicode code point seems to be 00E4 (the language is Swedish).

These curl commands work just fine from the command line:

curl -XGET
'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' {
"query" : {"match" :{"req_host" : "www.utkl\u00E4dningskl\u00E4derna.se"
}}}'
curl -XGET
'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' {
"query" : {"match" :{"req_host" : "www.utklädningskläderna.se" }}}'

They find and return the record

(the second line shows how the hostname appears in the log I pull it from,
showing the lowercase 'a' with a diaersis, in two places)

I've written a very short Python script to show the problem: It uses
hardwired queries, printing them and their type, then trying to use them
in a search.
---- start code ----

#!/usr/bin/python

-- coding: utf-8 --

import json
import elasticsearch

es = elasticsearch.Elasticsearch()

if name=="main":
#uq = u'{ "query": { "match": { "req_host": "www.utklädningskläderna.se"
}}}' # raw utf-8 characters. does not work
#uq = u'{ "query": { "match": { "req_host":
"www.utkl\u00E4dningskl\u00E4derna.se" }}}' # quoted unicode characters.
does not work
#uq = u'{ "query": { "match": { "req_host":
"www.utkl\uC3A4dningskl\uC3A4derna.se" }}}' # quoted uft-8 characters. does
not work
uq = u'{ "query": { "match": { "req_host": "www.facebook.com"
}}}' # non-unicode. works fine
print "uq", type(uq), uq
result =
es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
if result["hits"]["total"] == 0:
print "nothing found"
else:
print "found some"

--- end code ----

If I run it as shown, with the 'facebook' query, it's fine - the output is:

$python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "www.facebook.com"
}}}
found some
$

Note that the query string 'uq' is unicode.

But if I use the other three strings, which include the Unicode
characters, I get

python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host":
"www.utklädningskläderna.se" }}}
Traceback (most recent call last):
File "testutf8b.py", line 15, in
result =
es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
File "build/bdist.linux-x86_64/egg/elasticsearch/client/utils.py", line
68, in _wrapped
File "build/bdist.linux-x86_64/egg/elasticsearch/client/init.py",
line 497, in search
File "build/bdist.linux-x86_64/egg/elasticsearch/transport.py", line 307,
in perform_request
File
"build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py",
line 82, in perform_request
elasticsearch.exceptions.ConnectionError: ConnectionError('ascii' codec
can't decode byte 0xc3 in position 45: ordinal not in range(128)) caused
by: UnicodeDecodeError('ascii' codec can't decode byte 0xc3 in position 45:
ordinal not in range(128))

This is under Centos 7, using ES 1.5.0. The logs were digested into ES
under a slightly older version, using logstasth-1.4.2

Any ideas? ES documentation contains sections about codecs, but that's for
analysis. This looks to me like an
elasticsearch-py library issue, (or I'm doing something stupid).

thanks!

PT

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/505b8b8f-faeb-4954-8fcb-e3107c2c80b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

honzakral · April 3, 2015, 9:24pm

Hi Peter,

the easiest way is to just use python dictionaries - not strings containing
json. Then everything will be correctly encoded and decoded into utf-8 and
you can just work with unicode strings in your app.

Hope this helps

On Fri, Apr 3, 2015 at 8:13 PM, Peter Trei petertrei@gmail.com wrote:

This issue is probably due to my noobishness to ELK, Python, and Unicode.

I have an index containing logstash-digested logs, including a field
'host_req', which contains a host name.
Using Elasticsearch-py, I'm pulling that host name out of the record, and
using it to search in another index.
However, if the hostname contains multibyte characters, it fails with a
UnicodeDecodeError
Exactly the same query works fine when I enter it from the command line
with 'curl -XGET'
The unicode character is a lowercase 'a' with a diaeresis (two dots). The
UTF-8 value is C3 A4,
and the unicode code point seems to be 00E4 (the language is Swedish).

These curl commands work just fine from the command line:

curl -XGET '
http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' {
"query" : {"match" :{"req_host" : "www.utkl\u00E4dningskl\u00E4derna.se"
}}}'
curl -XGET '
http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' {
"query" : {"match" :{"req_host" : "www.utklädningskläderna.se
http://www.utklädningskläderna.se" }}}'

They find and return the record

(the second line shows how the hostname appears in the log I pull it from,
showing the lowercase 'a' with a diaersis, in two places)

I've written a very short Python script to show the problem: It uses
hardwired queries, printing them and their type, then trying to use them
in a search.
---- start code ----

#!/usr/bin/python

-- coding: utf-8 --

import json
import elasticsearch

es = elasticsearch.Elasticsearch()

if name=="main":
#uq = u'{ "query": { "match": { "req_host": "www.utklädningskläderna.se
http://www.utklädningskläderna.se" }}}' # raw
utf-8 characters. does not work
#uq = u'{ "query": { "match": { "req_host":
"www.utkl\u00E4dningskl\u00E4derna.se" }}}' # quoted unicode characters.
does not work
#uq = u'{ "query": { "match": { "req_host":
"www.utkl\uC3A4dningskl\uC3A4derna.se" }}}' # quoted uft-8 characters. does
not work
uq = u'{ "query": { "match": { "req_host": "www.facebook.com"
}}}' # non-unicode. works fine
print "uq", type(uq), uq
result =
es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
if result["hits"]["total"] == 0:
print "nothing found"
else:
print "found some"

--- end code ----

If I run it as shown, with the 'facebook' query, it's fine - the output is:

$python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "www.facebook.com"
}}}
found some
$

Note that the query string 'uq' is unicode.

But if I use the other three strings, which include the Unicode
characters, I get

python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "
www.utklädningskläderna.se http://www.utklädningskläderna.se" }}}
Traceback (most recent call last):
File "testutf8b.py", line 15, in
result =
es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
File "build/bdist.linux-x86_64/egg/elasticsearch/client/utils.py", line
68, in _wrapped
File "build/bdist.linux-x86_64/egg/elasticsearch/client/init.py",
line 497, in search
File "build/bdist.linux-x86_64/egg/elasticsearch/transport.py", line
307, in perform_request
File
"build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py",
line 82, in perform_request
elasticsearch.exceptions.ConnectionError: ConnectionError('ascii' codec
can't decode byte 0xc3 in position 45: ordinal not in range(128)) caused
by: UnicodeDecodeError('ascii' codec can't decode byte 0xc3 in position 45:
ordinal not in range(128))

This is under Centos 7, using ES 1.5.0. The logs were digested into ES
under a slightly older version, using logstasth-1.4.2

Any ideas? ES documentation contains sections about codecs, but that's for
analysis. This looks to me like an
elasticsearch-py library issue, (or I'm doing something stupid).

thanks!

PT

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/505b8b8f-faeb-4954-8fcb-e3107c2c80b6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/505b8b8f-faeb-4954-8fcb-e3107c2c80b6%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Honza Král
Python Engineer
honza.kral@elastic.co

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAC4Vrtx7CVJ%3DsRwqyj572W7vy9_Wvtk37wYa1WDyY%2BegUF%2Bssg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
How does the ES to handle strings with different encodings? Elasticsearch	2	3913	July 6, 2017
How to search UTF-8 characters in Elastic Search Elasticsearch	4	3688	December 4, 2019
Enconding Issue Elasticsearch	4	2967	July 6, 2017
Content encoding issues Elasticsearch	4	1320	July 6, 2017
How's the encoding handling power of ES? Elasticsearch	3	342	July 6, 2017

How can I specify the UTF-8 codec when searching ES from python?

-- coding: utf-8 --

-- coding: utf-8 --

Related topics