I did not try to index attachments in array myself but single attachment per
document works fine or me. And your mappings look good to me. If I were you
I would try to check if other words from that document work for search or
not. Then I would try to check what content the Tika was able to get out of
this document (there is Swing console for Tika that can be used for quick
tests, see: http://tika.apache.org/0.7/gettingstarted.html#Using_Tika_as_a_command_line_utility
,
just run it and drag and drop that document to it and see the content Tika
is able extract). If the word is there then I would try check elasticsearch
index with Luke.
I did try to check if tika could make something from the Word
document, and it did. It could read the document successfully.
I got the error message (as a part of an MapperParsingException and a
backtrace):
Caused by: org.elasticsearch.common.jackson.JsonParseException:
Failed to decode VALUE_STRING as base64 (MIME-NO-LINEFEEDS): Illegal
character '' (code 0x5c) in base64 content
Now I use PHP, and php with json_encode does escape / characters. So
in the whole document there were a lot of / Now as a matter of test
I did replace all / occurences back to / and then it did index. So
there seems to be a bug in, or PHP or in Elasticsearch JSON parser
Jackson. I'm going to make a proof of concept, so I post this as bug
report.
I did not try to index attachments in array myself but single attachment per
document works fine or me. And your mappings look good to me. If I were you
I would try to check if other words from that document work for search or
not. Then I would try to check what content the Tika was able to get out of
this document (there is Swing console for Tika that can be used for quick
tests, see:Apache Tika – Getting Started with Apache Tika...
,
just run it and drag and drop that document to it and see the content Tika
is able extract). If the word is there then I would try check elasticsearch
index with Luke.
I did try to check if tika could make something from the Word
document, and it did. It could read the document successfully.
I got the error message (as a part of an MapperParsingException and a
backtrace):
Caused by: org.elasticsearch.common.jackson.JsonParseException:
Failed to decode VALUE_STRING as base64 (MIME-NO-LINEFEEDS): Illegal
character '' (code 0x5c) in base64 content
Now I use PHP, and php with json_encode does escape / characters. So
in the whole document there were a lot of / Now as a matter of test
I did replace all / occurences back to / and then it did index. So
there seems to be a bug in, or PHP or in Elasticsearch JSON parser
Jackson. I'm going to make a proof of concept, so I post this as bug
report.
I did not try to index attachments in array myself but single attachment per
document works fine or me. And your mappings look good to me. If I were you
I would try to check if other words from that document work for search or
not. Then I would try to check what content the Tika was able to get out of
this document (there is Swing console for Tika that can be used for quick
tests, see:Apache Tika – Getting Started with Apache Tika...
,
just run it and drag and drop that document to it and see the content Tika
is able extract). If the word is there then I would try check elasticsearch
index with Luke.
However seeing this chart on JSON.orghttp://www.json.org/string.gif
It does say that it's standard. But everything besides this chart says
not anything about escaping the / char.
I did try to check if tika could make something from the Word
document, and it did. It could read the document successfully.
I got the error message (as a part of an MapperParsingException and a
backtrace):
Caused by: org.elasticsearch.common.jackson.JsonParseException:
Failed to decode VALUE_STRING as base64 (MIME-NO-LINEFEEDS): Illegal
character '' (code 0x5c) in base64 content
Now I use PHP, and php with json_encode does escape / characters. So
in the whole document there were a lot of / Now as a matter of test
I did replace all / occurences back to / and then it did index. So
there seems to be a bug in, or PHP or in Elasticsearch JSON parser
Jackson. I'm going to make a proof of concept, so I post this as bug
report.
I did not try to index attachments in array myself but single attachment per
document works fine or me. And your mappings look good to me. If I were you
I would try to check if other words from that document work for search or
not. Then I would try to check what content the Tika was able to get out of
this document (there is Swing console for Tika that can be used for quick
tests, see:Apache Tika – Getting Started with Apache Tika...
,
just run it and drag and drop that document to it and see the content Tika
is able extract). If the word is there then I would try check elasticsearch
index with Luke.
Never mind, seems that / is a escapeable character, but some people
think it doesn't have to be escaped. That it doesn't have to be
escaped I didn't find yet, so I guess it's not true.
Seems a bug in PHP (see:http://bugs.php.net/bug.php?id=49366)
however they don't seem to think it's a bug, but jackson won't handle
it as escaped character though.
I did try to check if tika could make something from the Word
document, and it did. It could read the document successfully.
I got the error message (as a part of an MapperParsingException and a
backtrace):
Caused by: org.elasticsearch.common.jackson.JsonParseException:
Failed to decode VALUE_STRING as base64 (MIME-NO-LINEFEEDS): Illegal
character '' (code 0x5c) in base64 content
Now I use PHP, and php with json_encode does escape / characters. So
in the whole document there were a lot of / Now as a matter of test
I did replace all / occurences back to / and then it did index. So
there seems to be a bug in, or PHP or in Elasticsearch JSON parser
Jackson. I'm going to make a proof of concept, so I post this as bug
report.
I did not try to index attachments in array myself but single attachment per
document works fine or me. And your mappings look good to me. If I were you
I would try to check if other words from that document work for search or
not. Then I would try to check what content the Tika was able to get out of
this document (there is Swing console for Tika that can be used for quick
tests, see:Apache Tika – Getting Started with Apache Tika...
,
just run it and drag and drop that document to it and see the content Tika
is able extract). If the word is there then I would try check elasticsearch
index with Luke.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.