Some pdf can't be indexed


(Pasqualini Thierry) #1

Hello,
With some pdf, the result of the indexation via fscrawler is an empty content field (see below).

Any idea?
Thanks.

{
* "_index": "lesdocsmv",
* "_type": "_doc",
* "_id": "b3d35554-eb3a-4bea-947a-b998ebf4f387",
* "_version": 1,
* "_score": 1,
* "_source": {
  * "content": " ",
  * "meta": {
    * "format": "application/pdf; version=1.4",
    * "creator_tool": "Canon iR-ADV C5235 ",
    * "created": "2018-09-06T07:11:55.000+0000",
    * "raw": {
      * "pdf:PDFVersion": "1.4",
      * "xmp:CreatorTool": "Canon iR-ADV C5235 ",
      * "access_permission:modify_annotations": "true",
      * "access_permission:can_print_degraded": "true",
      * "dcterms:created": "2018-09-06T07:11:55Z",
      * "dc:format": "application/pdf; version=1.4",
      * "xmpMM:DocumentID": "uuid:42d3905b-0000-8887-177f-b25700000000",
      * "pdf:docinfo:creator_tool": "Canon iR-ADV C5235 ",

(David Pilato) #2

Could you share your PDF document? You can DM it to me.


(David Pilato) #4

@Thierry_Pasqualini So the document is an image and does not contain text.
The only way to extract text from it is to configure OCR.

Have a look at https://fscrawler.readthedocs.io/en/fscrawler-2.5/user/tips.html?highlight=ocr#ocr-integration

HTH


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.