I am not using JSP or anything.
My application when given a web page link , goes there get the text in
there and gives that to elasticSearch.
now lets say the link given is -
http://finance.fortune.cnn.com/2011/12/14/investor-roundtable-experts/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+fortunetermsheet+(Fortune+Finance%3A+Term+Sheet)
It fails while i sent it tru my application.
ElasticSearch gives the following error - https://gist.github.com/1485063
Now if i copy the text from that site manually , write it to a file from
bash shell or say gedit , and then run a normal curl command
to hit the elasticSearch to feed it data , things work very fine.
What should i do to achive the latter from my application.
Am behind this issue for couple of days.
Any help would be appreciated.
Thanks
Vineeth
On Fri, Dec 16, 2011 at 1:29 PM, david@pilato.fr david@pilato.fr wrote:
**
Encoding is sometimes very complicated...
How your web pages encode form submission? I suggest you tell the user
browser that everything is UTF-8.
In a JSP, you can add this directive : <%@ page contentType="text/html;
charset=UTF-8" %>
Save your JSP file in UTF-8 (in case you have special chars in it).
I don't know if you can tell ES that you are using another encoding. It's
perhaps only a JVM parameter.
HTH
David.
Le 16 décembre 2011 à 08:40, Vineeth Mohan vineethmohan@algotree.com a
écrit :
Even that is not working ... am giving up 
~Vineeth
On Fri, Dec 16, 2011 at 12:34 PM, Vineeth Mohan
vineethmohan@algotree.comwrote:
Finally came up with the fix
CharsetDetector cs=new CharsetDetector();
cs.setText(content.getBytes());
String sourceEncoding=cs.detect().getName();
logger.debug("ENCODING is"+sourceEncoding);
if(sourceEncoding.equals("Big5")){
logger.debug("ENCO skipping as encoding is Big5 , content
is
"+content);
return content;
}
String s =null ;
try {
byte bytes[] = content.getBytes(sourceEncoding);
s = new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Thanks
Vineeth
On Fri, Dec 16, 2011 at 9:16 AM, Vineeth Mohan <
vineethmohan@algotree.com>wrote:
Hi All,
My team is stuck on this. It would be of great if someone can help.
Thanks
Vineeth
On Thu, Dec 15, 2011 at 4:06 PM, Vineeth Mohan <
vineethmohan@algotree.com
wrote:
When i tries to give leasticSearch following text -
"Small businesses potentially in line for a lighter reporting load
include those with an annual turnover of less than £440,000, net
assets of
less than £220,000 and fewer than ten employees"
- Through my java application - Basically my java application
takes
this info from a webpage , and gives it to elasticSearch. ES
complaints it
cant understand £ and it fails.
2. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
3. I copy it to a file in my home using bash and it goes it fine.
I am really getting worked out stressed out solving this issue from
my
java application side.
Any pointers will help.
Thanks
Vineeth
On Thu, Dec 15, 2011 at 3:20 PM, Vineeth Mohan <
vineethmohan@algotree.com> wrote:
What is the general encoding of documents in internet , is it
ISO-8859-1 or something else.
Thanks
Vineeth
On Thu, Dec 15, 2011 at 1:58 PM, Karussell <
tableyourtime@googlemail.com> wrote:
its replacing all such characters with question mark.
That is a sign that your text is not UTF8. Make sure your source
documents are what you guessed (latin1?) and that they get
properly
converted to UTF8. (I do not understand your converting code. Try
to
explain what you want :))
Peter.
--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet