Some pdf can't be indexed

Thierry_Pasqualini · September 24, 2018, 12:27pm

Hello,
With some pdf, the result of the indexation via fscrawler is an empty content field (see below).

Any idea?
Thanks.

{
* "_index": "lesdocsmv",
* "_type": "_doc",
* "_id": "b3d35554-eb3a-4bea-947a-b998ebf4f387",
* "_version": 1,
* "_score": 1,
* "_source": {
  * "content": " ",
  * "meta": {
    * "format": "application/pdf; version=1.4",
    * "creator_tool": "Canon iR-ADV C5235 ",
    * "created": "2018-09-06T07:11:55.000+0000",
    * "raw": {
      * "pdf:PDFVersion": "1.4",
      * "xmp:CreatorTool": "Canon iR-ADV C5235 ",
      * "access_permission:modify_annotations": "true",
      * "access_permission:can_print_degraded": "true",
      * "dcterms:created": "2018-09-06T07:11:55Z",
      * "dc:format": "application/pdf; version=1.4",
      * "xmpMM:DocumentID": "uuid:42d3905b-0000-8887-177f-b25700000000",
      * "pdf:docinfo:creator_tool": "Canon iR-ADV C5235 ",

dadoonet · September 24, 2018, 12:41pm

Could you share your PDF document? You can DM it to me.

dadoonet · September 24, 2018, 1:21pm

@Thierry_Pasqualini So the document is an image and does not contain text.
The only way to extract text from it is to configure OCR.

Have a look at https://fscrawler.readthedocs.io/en/fscrawler-2.5/user/tips.html?highlight=ocr#ocr-integration

HTH

system · October 22, 2018, 1:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Return code for less successful PDF OCR in fscrawler? Elasticsearch	3	879	August 7, 2017
Indexing PDFs directly Elasticsearch	4	716	October 14, 2019
Unable to extract PDF content Elasticsearch	5	242	April 14, 2024
FsCrawler does not do anything, does not index pfd's Elasticsearch	4	1356	March 10, 2022
With FSCrawler 2.7 I am not able to index pdf and other types of documents which worked fine with 2.6 Elasticsearch	9	820	December 3, 2019

Some pdf can't be indexed

Related topics