What to do with non utf8 characters

vineeth_mohan · December 15, 2011, 6:49am

Hi ,

I am taking some text from the web and giving it to elasticSearch.
my problem is that , there are so many non utf8 characters in the text lile
�,“ã etc.
Now i used java's utf8 convertion thing but its replacing all such
characters with question mark.

How can i convert the text in such a way that , the text can be insrted
into elasticSearch and none of the
special characters are not lost.

Thanks
Vineeth

vineeth_mohan · December 15, 2011, 7:15am

Between , this is the code i used to convert to utf8

contant -> s
bbuf = encoder.encode(CharBuffer.wrap(content));
// Convert ISO-LATIN-1 bytes in a ByteBuffer to a character
ByteBuffer and then to a string.
// The new ByteBuffer is ready to be read.
CharBuffer cbuf = decoder.decode(bbuf);
s = cbuf.toString();
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");

On Thu, Dec 15, 2011 at 12:19 PM, Vineeth Mohan
vineethmohan@algotree.comwrote:

Hi ,

I am taking some text from the web and giving it to elasticSearch.
my problem is that , there are so many non utf8 characters in the text
lile �,“ã etc.
Now i used java's utf8 convertion thing but its replacing all such
characters with question mark.

How can i convert the text in such a way that , the text can be insrted
into elasticSearch and none of the
special characters are not lost.

Thanks
Vineeth

Karussell1 · December 15, 2011, 8:28am

its replacing all such characters with question mark.

That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))

Peter.

vineeth_mohan · December 15, 2011, 9:50am

What is the general encoding of documents in internet , is it ISO-8859-1 or
something else.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 1:58 PM, Karussell tableyourtime@googlemail.comwrote:

its replacing all such characters with question mark.

That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))

Peter.

vineeth_mohan · December 15, 2011, 10:36am

When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load include
those with an annual turnover of less than £440,000, net assets of less
than £220,000 and fewer than ten employees"

Through my java application - Basically my java application takes
this info from a webpage , and gives it to elasticSearch. ES complaints it
cant understand £ and it fails.
After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
I copy it to a file in my home using bash and it goes it fine.

I am really getting worked out stressed out solving this issue from my java
application side.
Any pointers will help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 3:20 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

What is the general encoding of documents in internet , is it ISO-8859-1
or something else.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 1:58 PM, Karussell tableyourtime@googlemail.comwrote:

its replacing all such characters with question mark.

That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))

Peter.

vineeth_mohan · December 16, 2011, 3:46am

Hi All,

My team is stuck on this. It would be of great if someone can help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 4:06 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load include
those with an annual turnover of less than £440,000, net assets of less
than £220,000 and fewer than ten employees"

Through my java application - Basically my java application takes
this info from a webpage , and gives it to elasticSearch. ES complaints it
cant understand £ and it fails.

After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �

I copy it to a file in my home using bash and it goes it fine.

I am really getting worked out stressed out solving this issue from my
java application side.
Any pointers will help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 3:20 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

What is the general encoding of documents in internet , is it ISO-8859-1
or something else.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 1:58 PM, Karussell tableyourtime@googlemail.comwrote:

its replacing all such characters with question mark.

That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))

Peter.

vineeth_mohan · December 16, 2011, 7:04am

Finally came up with the fix

    CharsetDetector cs=new CharsetDetector();
    cs.setText(content.getBytes());
    String sourceEncoding=cs.detect().getName();
    logger.debug("ENCODING is"+sourceEncoding);
    if(sourceEncoding.equals("Big5")){
        logger.debug("ENCO skipping as encoding is Big5 , content is

"+content);
return content;
}
String s =null ;
try {
byte bytes = content.getBytes(sourceEncoding);
s = new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

Thanks
Vineeth
On Fri, Dec 16, 2011 at 9:16 AM, Vineeth Mohan vineethmohan@algotree.comwrote:

Hi All,

My team is stuck on this. It would be of great if someone can help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 4:06 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load
include those with an annual turnover of less than £440,000, net assets of
less than £220,000 and fewer than ten employees"

Through my java application - Basically my java application takes
this info from a webpage , and gives it to elasticSearch. ES complaints it
cant understand £ and it fails.

After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �

I copy it to a file in my home using bash and it goes it fine.

I am really getting worked out stressed out solving this issue from my
java application side.
Any pointers will help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 3:20 PM, Vineeth Mohan <vineethmohan@algotree.com

wrote:

What is the general encoding of documents in internet , is it ISO-8859-1
or something else.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 1:58 PM, Karussell <tableyourtime@googlemail.com

wrote:

its replacing all such characters with question mark.

That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))

Peter.

vineeth_mohan · December 16, 2011, 7:40am

Even that is not working ... am giving up

~Vineeth

On Fri, Dec 16, 2011 at 12:34 PM, Vineeth Mohan
vineethmohan@algotree.comwrote:

Finally came up with the fix
    CharsetDetector cs=new CharsetDetector();
    cs.setText(content.getBytes());
    String sourceEncoding=cs.detect().getName();
    logger.debug("ENCODING is"+sourceEncoding);
    if(sourceEncoding.equals("Big5")){
        logger.debug("ENCO skipping as encoding is Big5 , content is
"+content);
return content;
}
String s =null ;
try {
byte bytes = content.getBytes(sourceEncoding);
        s = new String(bytes, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
Thanks
Vineeth

On Fri, Dec 16, 2011 at 9:16 AM, Vineeth Mohan vineethmohan@algotree.comwrote:

Hi All,

My team is stuck on this. It would be of great if someone can help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 4:06 PM, Vineeth Mohan <vineethmohan@algotree.com

wrote:

When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load
include those with an annual turnover of less than £440,000, net assets of
less than £220,000 and fewer than ten employees"

Through my java application - Basically my java application takes
this info from a webpage , and gives it to elasticSearch. ES complaints it
cant understand £ and it fails.

After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �

I copy it to a file in my home using bash and it goes it fine.

I am really getting worked out stressed out solving this issue from my
java application side.
Any pointers will help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 3:20 PM, Vineeth Mohan <
vineethmohan@algotree.com> wrote:

What is the general encoding of documents in internet , is it
ISO-8859-1 or something else.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 1:58 PM, Karussell <
tableyourtime@googlemail.com> wrote:

its replacing all such characters with question mark.

That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))

Peter.

dadoonet · December 16, 2011, 7:59am

Encoding is sometimes very complicated...

How your web pages encode form submission?I suggest you tell the user browser
that everything is UTF-8.
In a JSP, you can add this directive : <%@ page contentType="text/html;
charset=UTF-8" %>

Save your JSP file in UTF-8 (in case you have special chars in it).

I don't know if you can tell ES that you are using another encoding. It's
perhaps only a JVM parameter.

HTH
David.

Le 16 décembre 2011 à 08:40, Vineeth Mohan vineethmohan@algotree.com a écrit :

Even that is not working ... am giving up

~Vineeth

On Fri, Dec 16, 2011 at 12:34 PM, Vineeth Mohan
vineethmohan@algotree.comwrote:
Finally came up with the fix
    CharsetDetector cs=new CharsetDetector();
    cs.setText(content.getBytes());
    String sourceEncoding=cs.detect().getName();
    logger.debug("ENCODING is"+sourceEncoding);
    if(sourceEncoding.equals("Big5")){
        logger.debug("ENCO skipping as encoding is Big5 , content is
"+content);
return content;
}
String s =null ;
try {
byte bytes = content.getBytes(sourceEncoding);
        s = new String(bytes, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
Thanks
Vineeth

On Fri, Dec 16, 2011 at 9:16 AM, Vineeth Mohan
vineethmohan@algotree.comwrote:

Hi All,

My team is stuck on this. It would be of great if someone can help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 4:06 PM, Vineeth Mohan <vineethmohan@algotree.com

wrote:

When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load
include those with an annual turnover of less than £440,000, net assets of
less than £220,000 and fewer than ten employees"

Through my java application - Basically my java application takes
this info from a webpage , and gives it to elasticSearch. ES complaints
it
cant understand £ and it fails.

After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �

I copy it to a file in my home using bash and it goes it fine.

I am really getting worked out stressed out solving this issue from my
java application side.
Any pointers will help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 3:20 PM, Vineeth Mohan <
vineethmohan@algotree.com> wrote:

What is the general encoding of documents in internet , is it
ISO-8859-1 or something else.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 1:58 PM, Karussell <
tableyourtime@googlemail.com> wrote:

its replacing all such characters with question mark.

That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))

Peter.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet

vineeth_mohan · December 16, 2011, 8:15am

I am not using JSP or anything.

My application when given a web page link , goes there get the text in
there and gives that to elasticSearch.

now lets say the link given is -
http://finance.fortune.cnn.com/2011/12/14/investor-roundtable-experts/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+fortunetermsheet+(Fortune+Finance%3A+Term+Sheet)

It fails while i sent it tru my application.
Elasticsearch gives the following error - gist:1485063 · GitHub

Now if i copy the text from that site manually , write it to a file from
bash shell or say gedit , and then run a normal curl command
to hit the elasticSearch to feed it data , things work very fine.

What should i do to achive the latter from my application.
Am behind this issue for couple of days.
Any help would be appreciated.

Thanks
Vineeth

On Fri, Dec 16, 2011 at 1:29 PM, david@pilato.fr david@pilato.fr wrote:

**

Encoding is sometimes very complicated...

How your web pages encode form submission? I suggest you tell the user
browser that everything is UTF-8.

In a JSP, you can add this directive : <%@ page contentType="text/html;
charset=UTF-8" %>

Save your JSP file in UTF-8 (in case you have special chars in it).

I don't know if you can tell ES that you are using another encoding. It's
perhaps only a JVM parameter.

HTH

David.

Le 16 décembre 2011 à 08:40, Vineeth Mohan vineethmohan@algotree.com a
écrit :
Even that is not working ... am giving up

~Vineeth

On Fri, Dec 16, 2011 at 12:34 PM, Vineeth Mohan
vineethmohan@algotree.comwrote:
Finally came up with the fix
    CharsetDetector cs=new CharsetDetector();
    cs.setText(content.getBytes());
    String sourceEncoding=cs.detect().getName();
    logger.debug("ENCODING is"+sourceEncoding);
    if(sourceEncoding.equals("Big5")){
        logger.debug("ENCO skipping as encoding is Big5 , content
is
"+content);
return content;
}
String s =null ;
try {
byte bytes = content.getBytes(sourceEncoding);
        s = new String(bytes, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
Thanks
Vineeth

On Fri, Dec 16, 2011 at 9:16 AM, Vineeth Mohan <
vineethmohan@algotree.com>wrote:

Hi All,

My team is stuck on this. It would be of great if someone can help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 4:06 PM, Vineeth Mohan <
vineethmohan@algotree.com

wrote:

When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load
include those with an annual turnover of less than £440,000, net
assets of
less than £220,000 and fewer than ten employees"

Through my java application - Basically my java application
takes
this info from a webpage , and gives it to elasticSearch. ES
complaints it
cant understand £ and it fails.

After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �

I copy it to a file in my home using bash and it goes it fine.

I am really getting worked out stressed out solving this issue from
my
java application side.
Any pointers will help.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 3:20 PM, Vineeth Mohan <
vineethmohan@algotree.com> wrote:

What is the general encoding of documents in internet , is it
ISO-8859-1 or something else.

Thanks
Vineeth

On Thu, Dec 15, 2011 at 1:58 PM, Karussell <
tableyourtime@googlemail.com> wrote:

its replacing all such characters with question mark.

That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get
properly
converted to UTF8. (I do not understand your converting code. Try
to
explain what you want :))

Peter.
--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet

Clinton_Gormley · December 16, 2011, 10:03am

Hi Vineeth

My application when given a web page link , goes there get the text in
there and gives that to elasticSearch.

now lets say the link given is -
http://finance.fortune.cnn.com/2011/12/14/investor-roundtable-experts/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+fortunetermsheet+(Fortune+Finance%3A+Term+Sheet)

It fails while i sent it tru my application.
Elasticsearch gives the following error -
gist:1485063 · GitHub

This problem has nothing to do with Elasticsearch. This is an issue
with your application not understanding encodings properly.

The page that you link to above claims to be UTF-8 (this may or may not
be true) but Firefox certainly seems to agree. That, of course, doesn't
mean that it doesn't have also include some illegal non-utf8 characters.

When you request the page, you get back a stream of bytes. The first
thing you need to do is to convert those bytes from whatever encoding
they are in to strings that your programming language can understand.

In Perl, for instance, we'd use the Encode module:

use Encode qw(decode);
my $bytes = get_page($url);
my $content = decode('utf8',$bytes);

Now $content is a unicode string that your application understands,
rather than just a stream of bytes in one encoding or another.

Note: web pages quite often claim to be one encoding, but the content is
actually in a different encoding. In Perl, we would use something like
Encode::Guess to try to determine the real encoding

Your mileage may vary

clint

vineeth_mohan · December 16, 2011, 10:28am

Thanks for you reply clint.
I tired the following methods

Guess the encoding using ICU and set that as the base encoding
use ICU Normalize

Both the above accomodated some of the non utf characters but not all.
Seems i need to invest lots more of my time on this.

I felt this was a common use case and someone would have found a silver
bullet to it.

Thanks
Vineeth

On Fri, Dec 16, 2011 at 3:33 PM, Clinton Gormley clint@traveljury.comwrote:

Hi Vineeth

My application when given a web page link , goes there get the text in
there and gives that to elasticSearch.

now lets say the link given is -

http://finance.fortune.cnn.com/2011/12/14/investor-roundtable-experts/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+fortunetermsheet+(Fortune+Finance%3A+Term+Sheet)

It fails while i sent it tru my application.
Elasticsearch gives the following error -
gist:1485063 · GitHub

This problem has nothing to do with Elasticsearch. This is an issue
with your application not understanding encodings properly.

The page that you link to above claims to be UTF-8 (this may or may not
be true) but Firefox certainly seems to agree. That, of course, doesn't
mean that it doesn't have also include some illegal non-utf8 characters.

When you request the page, you get back a stream of bytes. The first
thing you need to do is to convert those bytes from whatever encoding
they are in to strings that your programming language can understand.

In Perl, for instance, we'd use the Encode module:

use Encode qw(decode);
my $bytes = get_page($url);
my $content = decode('utf8',$bytes);

Now $content is a unicode string that your application understands,
rather than just a stream of bytes in one encoding or another.

Note: web pages quite often claim to be one encoding, but the content is
actually in a different encoding. In Perl, we would use something like
Encode::Guess to try to determine the real encoding
https://metacpan.org/module/Encode::Guess

Your mileage may vary

clint

Clinton_Gormley · December 16, 2011, 11:13am

Hi Vineeth

On Fri, 2011-12-16 at 15:58 +0530, Vineeth Mohan wrote:

Thanks for you reply clint.
I tired the following methods
1. Guess the encoding using ICU and set that as the base encoding
2. use ICU Normalize
Both the above accomodated some of the non utf characters but not all.
Seems i need to invest lots more of my time on this.

I felt this was a common use case and someone would have found a
silver bullet to it.

This is really something that you need to do in your application, before
you hit elasticsearch. ES expects to receive valid JSON (which must be
in UTF8), so your application needs to do the decoding and cleaning up
of the data. This is not something that can be handled in ES

clint

Thanks
Vineeth

On Fri, Dec 16, 2011 at 3:33 PM, Clinton Gormley
clint@traveljury.com wrote:
Hi Vineeth

    > My application when given a web page link , goes there get
    the text in
    > there and gives that to elasticSearch.
    >
    > now lets say the link given is -
    >
    http://finance.fortune.cnn.com/2011/12/14/investor-roundtable-experts/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+fortunetermsheet+%28Fortune+Finance%3A+Term+Sheet%29
    >
    > It fails while i sent it tru my application.
    > ElasticSearch gives the following error -
    > https://gist.github.com/1485063
    
    
    This problem has nothing to do with ElasticSearch.  This is an
    issue
    with your application not understanding encodings properly.
    
    The page that you link to above claims to be UTF-8 (this may
    or may not
    be true) but Firefox certainly seems to agree.  That, of
    course, doesn't
    mean that it doesn't have also include some illegal non-utf8
    characters.
    
    When you request the page, you get back a stream of bytes.
     The first
    thing you need to do is to convert those bytes from whatever
    encoding
    they are in to strings that your programming language can
    understand.
    
    In Perl, for instance, we'd use the Encode module:
    
    use Encode qw(decode);
    my $bytes = get_page($url);
    my $content = decode('utf8',$bytes);
    
    Now $content is a unicode string that your application
    understands,
    rather than just a stream of bytes in one encoding or another.
    
    Note: web pages quite often claim to be one encoding, but the
    content is
    actually in a different encoding.  In Perl, we would use
    something like
    Encode::Guess to try to determine the real encoding
    https://metacpan.org/module/Encode::Guess
    
    Your mileage may vary ;)
    
    clint

vineeth_mohan · December 16, 2011, 2:31pm

I tired everything i could google up or think with the little multithreaded
brain of mine.
All i could do is
change the error from
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 start byte 0xa0
to
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 middle byte 0x6e

Move the invalidiy of the byte from first to start to middle (cool naaa [?])

If anyone have the slightest idea where i am wrong , please point out.

The following code works better than all previous ones , but not on all
invalid UTF characters.

        byte bytes[] = content.getBytes("UTF-8");
        parsed = new String(bytes,"ISO-8859-1" );

Thanks
Vineeth

On Fri, Dec 16, 2011 at 4:43 PM, Clinton Gormley clint@traveljury.comwrote:

Hi Vineeth

On Fri, 2011-12-16 at 15:58 +0530, Vineeth Mohan wrote:

Thanks for you reply clint.
I tired the following methods
1. Guess the encoding using ICU and set that as the base encoding
2. use ICU Normalize
Both the above accomodated some of the non utf characters but not all.
Seems i need to invest lots more of my time on this.

I felt this was a common use case and someone would have found a
silver bullet to it.

This is really something that you need to do in your application, before
you hit elasticsearch. ES expects to receive valid JSON (which must be
in UTF8), so your application needs to do the decoding and cleaning up
of the data. This is not something that can be handled in ES

clint
Thanks
Vineeth

On Fri, Dec 16, 2011 at 3:33 PM, Clinton Gormley
clint@traveljury.com wrote:
Hi Vineeth
    > My application when given a web page link , goes there get
    the text in
    > there and gives that to elasticSearch.
    >
    > now lets say the link given is -
    >
http://finance.fortune.cnn.com/2011/12/14/investor-roundtable-experts/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+fortunetermsheet+(Fortune+Finance%3A+Term+Sheet)
    >
    > It fails while i sent it tru my application.
    > ElasticSearch gives the following error -
    > https://gist.github.com/1485063


    This problem has nothing to do with ElasticSearch.  This is an
    issue
    with your application not understanding encodings properly.

    The page that you link to above claims to be UTF-8 (this may
    or may not
    be true) but Firefox certainly seems to agree.  That, of
    course, doesn't
    mean that it doesn't have also include some illegal non-utf8
    characters.

    When you request the page, you get back a stream of bytes.
     The first
    thing you need to do is to convert those bytes from whatever
    encoding
    they are in to strings that your programming language can
    understand.

    In Perl, for instance, we'd use the Encode module:

    use Encode qw(decode);
    my $bytes = get_page($url);
    my $content = decode('utf8',$bytes);

    Now $content is a unicode string that your application
    understands,
    rather than just a stream of bytes in one encoding or another.

    Note: web pages quite often claim to be one encoding, but the
    content is
    actually in a different encoding.  In Perl, we would use
    something like
    Encode::Guess to try to determine the real encoding
    https://metacpan.org/module/Encode::Guess

    Your mileage may vary ;)

    clint

vineeth_mohan · December 17, 2011, 9:24am

Finally what i did

Observed that all the special characters whcih were coming to me came
in this format in html source - Ř
Before putting my data into ES , converted the special characters to
the above format.
And all is well

Thanks
Vineeth

On Fri, Dec 16, 2011 at 8:01 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

I tired everything i could google up or think with the little
multithreaded brain of mine.
All i could do is
change the error from
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 start byte 0xa0
to
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 middle byte 0x6e

Move the invalidiy of the byte from first to start to middle (cool naaa
[?] )

If anyone have the slightest idea where i am wrong , please point out.

The following code works better than all previous ones , but not on all
invalid UTF characters.
        byte bytes[] = content.getBytes("UTF-8");
        parsed = new String(bytes,"ISO-8859-1" );
Thanks
Vineeth

On Fri, Dec 16, 2011 at 4:43 PM, Clinton Gormley clint@traveljury.comwrote:
Hi Vineeth

On Fri, 2011-12-16 at 15:58 +0530, Vineeth Mohan wrote:

Thanks for you reply clint.
I tired the following methods
1. Guess the encoding using ICU and set that as the base encoding
2. use ICU Normalize
Both the above accomodated some of the non utf characters but not all.
Seems i need to invest lots more of my time on this.

I felt this was a common use case and someone would have found a
silver bullet to it.

This is really something that you need to do in your application, before
you hit elasticsearch. ES expects to receive valid JSON (which must be
in UTF8), so your application needs to do the decoding and cleaning up
of the data. This is not something that can be handled in ES

clint
Thanks
Vineeth

On Fri, Dec 16, 2011 at 3:33 PM, Clinton Gormley
clint@traveljury.com wrote:
Hi Vineeth
    > My application when given a web page link , goes there get
    the text in
    > there and gives that to elasticSearch.
    >
    > now lets say the link given is -
    >
http://finance.fortune.cnn.com/2011/12/14/investor-roundtable-experts/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+fortunetermsheet+(Fortune+Finance%3A+Term+Sheet)
    >
    > It fails while i sent it tru my application.
    > ElasticSearch gives the following error -
    > https://gist.github.com/1485063


    This problem has nothing to do with ElasticSearch.  This is an
    issue
    with your application not understanding encodings properly.

    The page that you link to above claims to be UTF-8 (this may
    or may not
    be true) but Firefox certainly seems to agree.  That, of
    course, doesn't
    mean that it doesn't have also include some illegal non-utf8
    characters.

    When you request the page, you get back a stream of bytes.
     The first
    thing you need to do is to convert those bytes from whatever
    encoding
    they are in to strings that your programming language can
    understand.

    In Perl, for instance, we'd use the Encode module:

    use Encode qw(decode);
    my $bytes = get_page($url);
    my $content = decode('utf8',$bytes);

    Now $content is a unicode string that your application
    understands,
    rather than just a stream of bytes in one encoding or another.

    Note: web pages quite often claim to be one encoding, but the
    content is
    actually in a different encoding.  In Perl, we would use
    something like
    Encode::Guess to try to determine the real encoding
    https://metacpan.org/module/Encode::Guess

    Your mileage may vary ;)

    clint

vineeth_mohan · December 17, 2011, 2:45pm

OK , so what was the real issue -

When i changed
StringEntity se = new StringEntity(params);
to
StringEntity se = new StringEntity(params,"UTF-8");
Everything went well.
Seems i had to specify charset while senting json to elasticSearch using
httpclient.

Thanks
Vineeth

On Sat, Dec 17, 2011 at 2:54 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Finally what i did

Observed that all the special characters whcih were coming to me
came in this format in html source - Ř

Before putting my data into ES , converted the special characters to
the above format.

And all is well

Thanks
Vineeth

On Fri, Dec 16, 2011 at 8:01 PM, Vineeth Mohan vineethmohan@algotree.comwrote:
I tired everything i could google up or think with the little
multithreaded brain of mine.
All i could do is
change the error from
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 start byte 0xa0
to
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 middle byte 0x6e

Move the invalidiy of the byte from first to start to middle (cool naaa
[?] )

If anyone have the slightest idea where i am wrong , please point out.

The following code works better than all previous ones , but not on all
invalid UTF characters.
        byte bytes[] = content.getBytes("UTF-8");
        parsed = new String(bytes,"ISO-8859-1" );
Thanks
Vineeth

On Fri, Dec 16, 2011 at 4:43 PM, Clinton Gormley clint@traveljury.comwrote:
Hi Vineeth

On Fri, 2011-12-16 at 15:58 +0530, Vineeth Mohan wrote:

Thanks for you reply clint.
I tired the following methods
1. Guess the encoding using ICU and set that as the base encoding
2. use ICU Normalize
Both the above accomodated some of the non utf characters but not all.
Seems i need to invest lots more of my time on this.

I felt this was a common use case and someone would have found a
silver bullet to it.

This is really something that you need to do in your application, before
you hit elasticsearch. ES expects to receive valid JSON (which must be
in UTF8), so your application needs to do the decoding and cleaning up
of the data. This is not something that can be handled in ES

clint
Thanks
Vineeth

On Fri, Dec 16, 2011 at 3:33 PM, Clinton Gormley
clint@traveljury.com wrote:
Hi Vineeth
    > My application when given a web page link , goes there get
    the text in
    > there and gives that to elasticSearch.
    >
    > now lets say the link given is -
    >
http://finance.fortune.cnn.com/2011/12/14/investor-roundtable-experts/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+fortunetermsheet+(Fortune+Finance%3A+Term+Sheet)
    >
    > It fails while i sent it tru my application.
    > ElasticSearch gives the following error -
    > https://gist.github.com/1485063


    This problem has nothing to do with ElasticSearch.  This is an
    issue
    with your application not understanding encodings properly.

    The page that you link to above claims to be UTF-8 (this may
    or may not
    be true) but Firefox certainly seems to agree.  That, of
    course, doesn't
    mean that it doesn't have also include some illegal non-utf8
    characters.

    When you request the page, you get back a stream of bytes.
     The first
    thing you need to do is to convert those bytes from whatever
    encoding
    they are in to strings that your programming language can
    understand.

    In Perl, for instance, we'd use the Encode module:

    use Encode qw(decode);
    my $bytes = get_page($url);
    my $content = decode('utf8',$bytes);

    Now $content is a unicode string that your application
    understands,
    rather than just a stream of bytes in one encoding or another.

    Note: web pages quite often claim to be one encoding, but the
    content is
    actually in a different encoding.  In Perl, we would use
    something like
    Encode::Guess to try to determine the real encoding
    https://metacpan.org/module/Encode::Guess

    Your mileage may vary ;)

    clint

anand · March 3, 2014, 1:54am

Thanks Vineet, you saved my day

riva · May 10, 2016, 1:00am

I'm having similar issues, but the ? marks only show up in one instance of elastic search but not another. On my local elastic search there is no problem with international characters, but on the elastic search in a different environment it shows up as ? marks. The code is the same, what could be causing the difference?

Topic		Replies	Views
Convert existing character encoding in ElasticSearch Elasticsearch	1	3612	June 19, 2019
Not able to post unicode characters to ES, they show up as question marks Elasticsearch	1	606	July 18, 2018
Elastic web crawler converting special chars to "�" Elastic Search elastic-app-search	2	137	June 17, 2024
How to transform indexed fields that appear in UTF-8 to unicode? Elasticsearch	2	2410	November 28, 2018
Unable to store data containing non-ASCII chars into elasticsearch Elasticsearch	3	2136	March 31, 2017

What to do with non utf8 characters

Related topics