arkadyzalko
(Arkadyzalko)
September 29, 2015, 10:35pm
1
Hi,
I have some problems when I'm trying to index some pdf files.
Searching works fine for PDFs with embedded fonts but if the font is not embedded ElasticSearch is not able to index fulltext.
I'm using this file:
https://drive.google.com/file/d/0B9fkTrpP5kPVcHZPOU1iMDktRDA/view?usp=sharing
My test shellscript:
host=localhost:9200
curl -X PUT "${host}/test/document/_mapping" -d '{
"document": {
"properties": {
"file": {
"type": "attachment",
"fields": {
"title": { "store": "yes" },
"file": { "term_vector": "with_positions_offsets", "store": "yes" }
}
}
}
}
}'
coded=`cat not_found.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file
curl -X POST "${host}/test/document/" -d @json.file
echo
curl -XPOST "${host}/_refresh"
curl "${host}/_search?pretty=true" -d '{
"fields" : ["title"],
"query" : {
"query_string" : {
"query" : "butter"
}
},
"highlight" : {
"fields" : {
"file" : {}
}
}
}'
What did I do wrong?
Thanks.
dadoonet
(David Pilato)
September 29, 2015, 10:51pm
2
That's may be an issue with Apache Tika which is used behind the scene by elasticsearch.
There is a small "main" application you can use to check what actually Tika is doing behind the scene, if you are a Java Developer.
See https://github.com/elastic/elasticsearch-mapper-attachments/blob/master/src/test/java/org/elasticsearch/index/mapper/attachment/test/standalone/StandaloneRunner.java
I'm often using this one...
arkadyzalko
(Arkadyzalko)
September 29, 2015, 11:05pm
3
I've changed my mapping to this:
"document": {
"properties": {
"file": {
"type": "attachment",
"fields": {
"title": { "store": "yes" },
"file": {
"term_vector": "with_positions_offsets",
"store": "yes",
"type": "string"
}
}
}
}
}
I found "\ufffd" char in many parts of my pdf. I think is a encoding problem. Do you know if I can fix that?
Text fonts information bellow:
arkady@aevo:~/$ pdffonts not_found.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Arial CID TrueType Identity-H yes no yes 12 0
[none] Type 3 Custom yes no yes 17 0
[none] Type 3 Custom yes no yes 274 0
Source Sans Pro CID TrueType Identity-H yes no yes 531 0
[none] Type 3 Custom yes no yes 536 0
Source Sans Pro CID TrueType Identity-H yes no yes 793 0
arkadyzalko
(Arkadyzalko)
September 29, 2015, 11:24pm
4
When I use directly in Tika, I see plain text below. Can I fix encoding used by Tika?
9/25/2015 Best Chocolate Chip Cookies Printer Friendly Allrecipes.com
http://allrecipes.com/recipe/10813/bestchocolatechipcookies/print/ 1/1
�
�- +
ÖÔ�(
�**&
ÕÔ�(
� ��4�
)
Õ�#
� ./��#*�*'�/ ��#$+��**&$ .
Recipe By: �*-�
��-$.+� �" .ý��# 24�($��' .ü�
)"- �$ )/.
Õ��0+��0// -ý�.*!/ ) �
��0+�2#$/ �.0"�-
��0+�+��& ���-*2)�.0"�-
Ö� "".
�/ �.+**).�1�)$''�� 3/-��/
��0+.��''�+0-+*. �!'*0-
��/ �.+**)���&$)"�.*��
Ö�/ �.+**).�#*/�2�/ -
Õ!Ö�/ �.+**)�.�'/
��0+.�. ($.2 /��#*�*'�/ ��#$+.
��0+��#*++ ��2�')0/.
�$- �/$*).
�- # �/�*1 )�/*�×ÙÔ�� "- .����ÕÛÙ�� "- .���ü
�- �(�/*" /# -�/# ��0// -ý�2#$/ �.0"�-ý��)���-*2)�.0"�-�0)/$'�.(**/#ü�� �/�$)�/# � "".�*) ��/���/$( ý
/# )�./$-�$)�/# �1�)$''�ü��$..*'1 ���&$)"�.*���$)�#*/�2�/ -ü�����/*���// -�'*)"�2$/#�.�'/ü��/$-�$)�!'*0-ý
�#*�*'�/ ��#$+.ý��)��)0/.ü��-*+��4�'�-" �.+**)!0'.�*)/*�0)"- �. ��+�).ü
��& �!*-���*0/�ÕÔ�($)0/ .�$)�/# �+- # �/ ��*1 )ý�*-�0)/$'� �" .��- �)$� '4��-*2) �ü
�
��
� ������������*�ÖÔÕÙ��''- �$+ .ü�*(�
�-$)/ ���-*(��''- �$+ .ü�*(�Ý!ÖÙ!ÖÔÕÙ
Õ
Ö
×
��-" /
ÝÔØÔ��*' -�$)��1
�
��
����
ý�� �ØÙÖÙÕ
�+*).*- �
Market Pantry All-
Purpose Flour - 5 lbs�
ÖüÔÝ�
��������
�
�#//+þ!!222ü/�-" /ü�*(!+!(�-& /�
+�)/-4��''�+0-+*. �
!'*0-��'�.!�!��
Õ×ØÛØÛÜ×��
������
������