I am taking some text from the web and giving it to elasticSearch.
my problem is that , there are so many non utf8 characters in the text lile
�,“ã etc.
Now i used java's utf8 convertion thing but its replacing all such
characters with question mark.
How can i convert the text in such a way that , the text can be insrted
into elasticSearch and none of the
special characters are not lost.
Between , this is the code i used to convert to utf8
contant -> s
bbuf = encoder.encode(CharBuffer.wrap(content));
// Convert ISO-LATIN-1 bytes in a ByteBuffer to a character
ByteBuffer and then to a string.
// The new ByteBuffer is ready to be read.
CharBuffer cbuf = decoder.decode(bbuf);
s = cbuf.toString();
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
I am taking some text from the web and giving it to elasticSearch.
my problem is that , there are so many non utf8 characters in the text
lile �,“ã etc.
Now i used java's utf8 convertion thing but its replacing all such
characters with question mark.
How can i convert the text in such a way that , the text can be insrted
into elasticSearch and none of the
special characters are not lost.
its replacing all such characters with question mark.
That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))
its replacing all such characters with question mark.
That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))
When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load include
those with an annual turnover of less than £440,000, net assets of less
than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes
this info from a webpage , and gives it to elasticSearch. ES complaints it
cant understand £ and it fails.
After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
I copy it to a file in my home using bash and it goes it fine.
I am really getting worked out stressed out solving this issue from my java
application side.
Any pointers will help.
its replacing all such characters with question mark.
That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))
When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load include
those with an annual turnover of less than £440,000, net assets of less
than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes
this info from a webpage , and gives it to elasticSearch. ES complaints it
cant understand £ and it fails.
After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
I copy it to a file in my home using bash and it goes it fine.
I am really getting worked out stressed out solving this issue from my
java application side.
Any pointers will help.
its replacing all such characters with question mark.
That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))
When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load
include those with an annual turnover of less than £440,000, net assets of
less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes
this info from a webpage , and gives it to elasticSearch. ES complaints it
cant understand £ and it fails.
After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
I copy it to a file in my home using bash and it goes it fine.
I am really getting worked out stressed out solving this issue from my
java application side.
Any pointers will help.
its replacing all such characters with question mark.
That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))
When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load
include those with an annual turnover of less than £440,000, net assets of
less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes
this info from a webpage , and gives it to elasticSearch. ES complaints it
cant understand £ and it fails.
After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
I copy it to a file in my home using bash and it goes it fine.
I am really getting worked out stressed out solving this issue from my
java application side.
Any pointers will help.
its replacing all such characters with question mark.
That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))
How your web pages encode form submission?I suggest you tell the user browser
that everything is UTF-8.
In a JSP, you can add this directive : <%@ page contentType="text/html;
charset=UTF-8" %>
Save your JSP file in UTF-8 (in case you have special chars in it).
I don't know if you can tell ES that you are using another encoding. It's
perhaps only a JVM parameter.
When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load
include those with an annual turnover of less than £440,000, net assets of
less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes
this info from a webpage , and gives it to elasticSearch. ES complaints
it
cant understand £ and it fails.
After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
I copy it to a file in my home using bash and it goes it fine.
I am really getting worked out stressed out solving this issue from my
java application side.
Any pointers will help.
its replacing all such characters with question mark.
That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get properly
converted to UTF8. (I do not understand your converting code. Try to
explain what you want :))
It fails while i sent it tru my application.
Elasticsearch gives the following error - gist:1485063 · GitHub
Now if i copy the text from that site manually , write it to a file from
bash shell or say gedit , and then run a normal curl command
to hit the elasticSearch to feed it data , things work very fine.
What should i do to achive the latter from my application.
Am behind this issue for couple of days.
Any help would be appreciated.
When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load
include those with an annual turnover of less than £440,000, net
assets of
less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application
takes
this info from a webpage , and gives it to elasticSearch. ES
complaints it
cant understand £ and it fails.
After filtering through below code -
byte bytes = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
I copy it to a file in my home using bash and it goes it fine.
I am really getting worked out stressed out solving this issue from
my
java application side.
Any pointers will help.
its replacing all such characters with question mark.
That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get
properly
converted to UTF8. (I do not understand your converting code. Try
to
explain what you want :))
It fails while i sent it tru my application.
Elasticsearch gives the following error - gist:1485063 · GitHub
This problem has nothing to do with Elasticsearch. This is an issue
with your application not understanding encodings properly.
The page that you link to above claims to be UTF-8 (this may or may not
be true) but Firefox certainly seems to agree. That, of course, doesn't
mean that it doesn't have also include some illegal non-utf8 characters.
When you request the page, you get back a stream of bytes. The first
thing you need to do is to convert those bytes from whatever encoding
they are in to strings that your programming language can understand.
In Perl, for instance, we'd use the Encode module:
use Encode qw(decode);
my $bytes = get_page($url);
my $content = decode('utf8',$bytes);
Now $content is a unicode string that your application understands,
rather than just a stream of bytes in one encoding or another.
Note: web pages quite often claim to be one encoding, but the content is
actually in a different encoding. In Perl, we would use something like
Encode::Guess to try to determine the real encoding
It fails while i sent it tru my application.
Elasticsearch gives the following error - gist:1485063 · GitHub
This problem has nothing to do with Elasticsearch. This is an issue
with your application not understanding encodings properly.
The page that you link to above claims to be UTF-8 (this may or may not
be true) but Firefox certainly seems to agree. That, of course, doesn't
mean that it doesn't have also include some illegal non-utf8 characters.
When you request the page, you get back a stream of bytes. The first
thing you need to do is to convert those bytes from whatever encoding
they are in to strings that your programming language can understand.
In Perl, for instance, we'd use the Encode module:
use Encode qw(decode);
my $bytes = get_page($url);
my $content = decode('utf8',$bytes);
Now $content is a unicode string that your application understands,
rather than just a stream of bytes in one encoding or another.
Note: web pages quite often claim to be one encoding, but the content is
actually in a different encoding. In Perl, we would use something like
Encode::Guess to try to determine the real encoding https://metacpan.org/module/Encode::Guess
On Fri, 2011-12-16 at 15:58 +0530, Vineeth Mohan wrote:
Thanks for you reply clint.
I tired the following methods
1. Guess the encoding using ICU and set that as the base encoding
2. use ICU Normalize
Both the above accomodated some of the non utf characters but not all.
Seems i need to invest lots more of my time on this.
I felt this was a common use case and someone would have found a
silver bullet to it.
This is really something that you need to do in your application, before
you hit elasticsearch. ES expects to receive valid JSON (which must be
in UTF8), so your application needs to do the decoding and cleaning up
of the data. This is not something that can be handled in ES
clint
Thanks
Vineeth
On Fri, Dec 16, 2011 at 3:33 PM, Clinton Gormley clint@traveljury.com wrote:
Hi Vineeth
> My application when given a web page link , goes there get
the text in
> there and gives that to elasticSearch.
>
> now lets say the link given is -
>
http://finance.fortune.cnn.com/2011/12/14/investor-roundtable-experts/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+fortunetermsheet+%28Fortune+Finance%3A+Term+Sheet%29
>
> It fails while i sent it tru my application.
> ElasticSearch gives the following error -
> https://gist.github.com/1485063
This problem has nothing to do with ElasticSearch. This is an
issue
with your application not understanding encodings properly.
The page that you link to above claims to be UTF-8 (this may
or may not
be true) but Firefox certainly seems to agree. That, of
course, doesn't
mean that it doesn't have also include some illegal non-utf8
characters.
When you request the page, you get back a stream of bytes.
The first
thing you need to do is to convert those bytes from whatever
encoding
they are in to strings that your programming language can
understand.
In Perl, for instance, we'd use the Encode module:
use Encode qw(decode);
my $bytes = get_page($url);
my $content = decode('utf8',$bytes);
Now $content is a unicode string that your application
understands,
rather than just a stream of bytes in one encoding or another.
Note: web pages quite often claim to be one encoding, but the
content is
actually in a different encoding. In Perl, we would use
something like
Encode::Guess to try to determine the real encoding
https://metacpan.org/module/Encode::Guess
Your mileage may vary ;)
clint
I tired everything i could google up or think with the little multithreaded
brain of mine.
All i could do is
change the error from
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 start byte 0xa0
to
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 middle byte 0x6e
Move the invalidiy of the byte from first to start to middle (cool naaa [?])
If anyone have the slightest idea where i am wrong , please point out.
The following code works better than all previous ones , but not on all
invalid UTF characters.
byte bytes[] = content.getBytes("UTF-8");
parsed = new String(bytes,"ISO-8859-1" );
On Fri, 2011-12-16 at 15:58 +0530, Vineeth Mohan wrote:
Thanks for you reply clint.
I tired the following methods
1. Guess the encoding using ICU and set that as the base encoding
2. use ICU Normalize
Both the above accomodated some of the non utf characters but not all.
Seems i need to invest lots more of my time on this.
I felt this was a common use case and someone would have found a
silver bullet to it.
This is really something that you need to do in your application, before
you hit elasticsearch. ES expects to receive valid JSON (which must be
in UTF8), so your application needs to do the decoding and cleaning up
of the data. This is not something that can be handled in ES
clint
Thanks
Vineeth
On Fri, Dec 16, 2011 at 3:33 PM, Clinton Gormley clint@traveljury.com wrote:
Hi Vineeth
> My application when given a web page link , goes there get
the text in
> there and gives that to elasticSearch.
>
> now lets say the link given is -
>
>
> It fails while i sent it tru my application.
> ElasticSearch gives the following error -
> https://gist.github.com/1485063
This problem has nothing to do with ElasticSearch. This is an
issue
with your application not understanding encodings properly.
The page that you link to above claims to be UTF-8 (this may
or may not
be true) but Firefox certainly seems to agree. That, of
course, doesn't
mean that it doesn't have also include some illegal non-utf8
characters.
When you request the page, you get back a stream of bytes.
The first
thing you need to do is to convert those bytes from whatever
encoding
they are in to strings that your programming language can
understand.
In Perl, for instance, we'd use the Encode module:
use Encode qw(decode);
my $bytes = get_page($url);
my $content = decode('utf8',$bytes);
Now $content is a unicode string that your application
understands,
rather than just a stream of bytes in one encoding or another.
Note: web pages quite often claim to be one encoding, but the
content is
actually in a different encoding. In Perl, we would use
something like
Encode::Guess to try to determine the real encoding
https://metacpan.org/module/Encode::Guess
Your mileage may vary ;)
clint
I tired everything i could google up or think with the little
multithreaded brain of mine.
All i could do is
change the error from
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 start byte 0xa0
to
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 middle byte 0x6e
Move the invalidiy of the byte from first to start to middle (cool naaa
[?] )
If anyone have the slightest idea where i am wrong , please point out.
The following code works better than all previous ones , but not on all
invalid UTF characters.
byte bytes[] = content.getBytes("UTF-8");
parsed = new String(bytes,"ISO-8859-1" );
On Fri, 2011-12-16 at 15:58 +0530, Vineeth Mohan wrote:
Thanks for you reply clint.
I tired the following methods
1. Guess the encoding using ICU and set that as the base encoding
2. use ICU Normalize
Both the above accomodated some of the non utf characters but not all.
Seems i need to invest lots more of my time on this.
I felt this was a common use case and someone would have found a
silver bullet to it.
This is really something that you need to do in your application, before
you hit elasticsearch. ES expects to receive valid JSON (which must be
in UTF8), so your application needs to do the decoding and cleaning up
of the data. This is not something that can be handled in ES
clint
Thanks
Vineeth
On Fri, Dec 16, 2011 at 3:33 PM, Clinton Gormley clint@traveljury.com wrote:
Hi Vineeth
> My application when given a web page link , goes there get
the text in
> there and gives that to elasticSearch.
>
> now lets say the link given is -
>
>
> It fails while i sent it tru my application.
> ElasticSearch gives the following error -
> https://gist.github.com/1485063
This problem has nothing to do with ElasticSearch. This is an
issue
with your application not understanding encodings properly.
The page that you link to above claims to be UTF-8 (this may
or may not
be true) but Firefox certainly seems to agree. That, of
course, doesn't
mean that it doesn't have also include some illegal non-utf8
characters.
When you request the page, you get back a stream of bytes.
The first
thing you need to do is to convert those bytes from whatever
encoding
they are in to strings that your programming language can
understand.
In Perl, for instance, we'd use the Encode module:
use Encode qw(decode);
my $bytes = get_page($url);
my $content = decode('utf8',$bytes);
Now $content is a unicode string that your application
understands,
rather than just a stream of bytes in one encoding or another.
Note: web pages quite often claim to be one encoding, but the
content is
actually in a different encoding. In Perl, we would use
something like
Encode::Guess to try to determine the real encoding
https://metacpan.org/module/Encode::Guess
Your mileage may vary ;)
clint
When i changed
StringEntity se = new StringEntity(params);
to
StringEntity se = new StringEntity(params,"UTF-8");
Everything went well.
Seems i had to specify charset while senting json to elasticSearch using
httpclient.
I tired everything i could google up or think with the little
multithreaded brain of mine.
All i could do is
change the error from
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 start byte 0xa0
to
Caused by: org.elasticsearch.common.jackson.JsonParseException: Invalid
UTF-8 middle byte 0x6e
Move the invalidiy of the byte from first to start to middle (cool naaa
[?] )
If anyone have the slightest idea where i am wrong , please point out.
The following code works better than all previous ones , but not on all
invalid UTF characters.
byte bytes[] = content.getBytes("UTF-8");
parsed = new String(bytes,"ISO-8859-1" );
On Fri, 2011-12-16 at 15:58 +0530, Vineeth Mohan wrote:
Thanks for you reply clint.
I tired the following methods
1. Guess the encoding using ICU and set that as the base encoding
2. use ICU Normalize
Both the above accomodated some of the non utf characters but not all.
Seems i need to invest lots more of my time on this.
I felt this was a common use case and someone would have found a
silver bullet to it.
This is really something that you need to do in your application, before
you hit elasticsearch. ES expects to receive valid JSON (which must be
in UTF8), so your application needs to do the decoding and cleaning up
of the data. This is not something that can be handled in ES
clint
Thanks
Vineeth
On Fri, Dec 16, 2011 at 3:33 PM, Clinton Gormley clint@traveljury.com wrote:
Hi Vineeth
> My application when given a web page link , goes there get
the text in
> there and gives that to elasticSearch.
>
> now lets say the link given is -
>
>
> It fails while i sent it tru my application.
> ElasticSearch gives the following error -
> https://gist.github.com/1485063
This problem has nothing to do with ElasticSearch. This is an
issue
with your application not understanding encodings properly.
The page that you link to above claims to be UTF-8 (this may
or may not
be true) but Firefox certainly seems to agree. That, of
course, doesn't
mean that it doesn't have also include some illegal non-utf8
characters.
When you request the page, you get back a stream of bytes.
The first
thing you need to do is to convert those bytes from whatever
encoding
they are in to strings that your programming language can
understand.
In Perl, for instance, we'd use the Encode module:
use Encode qw(decode);
my $bytes = get_page($url);
my $content = decode('utf8',$bytes);
Now $content is a unicode string that your application
understands,
rather than just a stream of bytes in one encoding or another.
Note: web pages quite often claim to be one encoding, but the
content is
actually in a different encoding. In Perl, we would use
something like
Encode::Guess to try to determine the real encoding
https://metacpan.org/module/Encode::Guess
Your mileage may vary ;)
clint
I'm having similar issues, but the ? marks only show up in one instance of elastic search but not another. On my local elastic search there is no problem with international characters, but on the elastic search in a different environment it shows up as ? marks. The code is the same, what could be causing the difference?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.