Attachment Plugin doesn't index fulltext of PDF with not embedded fonts


(Arkadyzalko) #1

Hi,

I have some problems when I'm trying to index some pdf files.

Searching works fine for PDFs with embedded fonts but if the font is not embedded ElasticSearch is not able to index fulltext.

I'm using this file:
https://drive.google.com/file/d/0B9fkTrpP5kPVcHZPOU1iMDktRDA/view?usp=sharing

My test shellscript:

host=localhost:9200
curl -X PUT "${host}/test/document/_mapping" -d '{
    "document": {
        "properties": {
            "file": {
                "type": "attachment",
                "fields": {
                    "title": { "store": "yes" },
                    "file": { "term_vector": "with_positions_offsets", "store": "yes" }
                }
            }
        }
    }
}'
coded=`cat not_found.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file
curl -X POST "${host}/test/document/" -d @json.file
echo
curl -XPOST "${host}/_refresh"
curl "${host}/_search?pretty=true" -d '{
    "fields" : ["title"],
    "query" : {
        "query_string" : {
            "query" : "butter"
        }
    },
    "highlight" : {
        "fields" : {
            "file" : {}
        }
    }
}'

What did I do wrong?

Thanks.


(David Pilato) #2

That's may be an issue with Apache Tika which is used behind the scene by elasticsearch.

There is a small "main" application you can use to check what actually Tika is doing behind the scene, if you are a Java Developer.

See https://github.com/elastic/elasticsearch-mapper-attachments/blob/master/src/test/java/org/elasticsearch/index/mapper/attachment/test/standalone/StandaloneRunner.java

I'm often using this one...


(Arkadyzalko) #3

I've changed my mapping to this:

"document": {
    "properties": {
      "file": {
        "type": "attachment",
        "fields": {
          "title": { "store": "yes" },
          "file": { 
            "term_vector": "with_positions_offsets", 
            "store": "yes", 
            "type": "string" 
          }
        }
      }
    }
  }

I found "\ufffd" char in many parts of my pdf. I think is a encoding problem. Do you know if I can fix that?

Text fonts information bellow:

arkady@aevo:~/$ pdffonts not_found.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Arial                                CID TrueType      Identity-H       yes no  yes     12  0
[none]                               Type 3            Custom           yes no  yes     17  0
[none]                               Type 3            Custom           yes no  yes    274  0
Source Sans Pro                      CID TrueType      Identity-H       yes no  yes    531  0
[none]                               Type 3            Custom           yes no  yes    536  0
Source Sans Pro                      CID TrueType      Identity-H       yes no  yes    793  0

(Arkadyzalko) #4

When I use directly in Tika, I see plain text below. Can I fix encoding used by Tika?

9/25/2015 Best Chocolate Chip Cookies ­ Printer Friendly ­ Allrecipes.com

http://allrecipes.com/recipe/10813/best­chocolate­chip­cookies/print/ 1/1
�
�- +
ÖÔ�(
�**&
ÕÔ�(
� ��4�
)
Õ�#
� ./��#*�*'�/ ��#$+��**&$ .
Recipe By: �*-�
��-$.+� �" .ý��# 24�($��' .ü�
)"- �$ )/.
Õ��0+��0// -ý�.*!/ ) �
��0+�2#$/ �.0"�-
��0+�+��& ���-*2)�.0"�-
Ö� "".
�/ �.+**).�1�)$''�� 3/-��/
��0+.��''�+0-+*. �!'*0-
��/ �.+**)���&$)"�.*��
Ö�/ �.+**).�#*/�2�/ -
Õ!Ö�/ �.+**)�.�'/
��0+.�. ($.2  /��#*�*'�/ ��#$+.
��0+��#*++ ��2�')0/.
�$- �/$*).
�- # �/�*1 )�/*�×ÙÔ�� "-  .����ÕÛÙ�� "-  .���ü
�- �(�/*" /# -�/# ��0// -ý�2#$/ �.0"�-ý��)���-*2)�.0"�-�0)/$'�.(**/#ü�� �/�$)�/# � "".�*) ��/���/$( ý
/# )�./$-�$)�/# �1�)$''�ü��$..*'1 ���&$)"�.*���$)�#*/�2�/ -ü�����/*���// -�'*)"�2$/#�.�'/ü��/$-�$)�!'*0-ý
�#*�*'�/ ��#$+.ý��)��)0/.ü��-*+��4�'�-" �.+**)!0'.�*)/*�0)"- �. ��+�).ü
��& �!*-���*0/�ÕÔ�($)0/ .�$)�/# �+- # �/ ��*1 )ý�*-�0)/$'� �" .��- �)$� '4��-*2) �ü
�
��
�	������������*�ÖÔÕÙ��''- �$+ .ü�*(�
�-$)/ ���-*(��''- �$+ .ü�*(�Ý!ÖÙ!ÖÔÕÙ
Õ
Ö
×
��-" /
ÝÔØÔ��*' -�$)��1 
�
��
����
ý��	�ØÙÖÙÕ
�+*).*- �
Market Pantry All-
Purpose Flour - 5 lbs�
ÖüÔÝ�
��������
�
�#//+þ!!222ü/�-" /ü�*(!+!(�-& /�
+�)/-4��''�+0-+*. �
!'*0-��'�.!�!��
Õ×ØÛØÛÜ×��
������
������

(system) #5