Fscrawler ocr question

neergttocsdivad · September 23, 2019, 10:22am

ocr:
    language: "eng"
    enabled: false
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false

If I enable OCR in _settings.yaml, will all files be OCR'd - even those which already contain indexable text - or just those for which there is currently no indexable text ?

Also, does "follow_symlinks" mean that url's will be hyperlinked and made clickable ?

dadoonet · September 25, 2019, 11:19am

It will extract both text and images.
This test shows it:

github.com

dadoonet/fscrawler/blob/ff0310ba3a6b39abe0a580a3ba23c225350a1a67/tika/src/test/java/fr/pilato/elasticsearch/crawler/fs/tika/TikaDocParserTest.java#L665-L667


doc = extractFromFile("test-ocr.pdf");
assertThat(doc.getContent(), containsString("This file contains some words."));
assertThat(doc.getContent(), containsString("This file also contains text."));

It reads a PDF document which has an image plus some text. Both are extracted.

system · October 23, 2019, 11:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FScrawler: perform OCR selectively only on PDF files that do not have text Elasticsearch	6	921	July 16, 2020
Read image text from pdf Elasticsearch	54	5234	June 7, 2017
Could not see OCR text in "content" field Elasticsearch	19	1293	August 3, 2020
FS Crawler - Issue with OCR Elasticsearch docker	7	928	September 2, 2022
Fscrawler image file text extraction Elasticsearch	7	739	August 22, 2021

Fscrawler ocr question

Related topics