FScrawler not parsing jpg in PDF

Hi,@dadoonet
I'm using fscrawler to index my stuff into ES and I find that fscrawler can't parse jpg in PDF though pdf_strategy: "ocr_and_text" has been set. My settings on ocr are listed below:

  ocr:
    language: "eng+chi_sim"
    enabled: true
    path: "F:\\Tesseract-OCR"
    data_path: "F:\\Tesseract-OCR\\tessdata"
    output_type: "txt"
    pdf_strategy: "ocr_and_text"

I have tried to parse and index a JPG file and it works fine, which indicates that ocr function is enabled.
I also noticed that a warning message when I run fscrawler in the command line:

10:43:14,998 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Wondering whether this warning may be responsible for this problem?

Any help will be appreciated.

Welcome!

Did you see https://fscrawler.readthedocs.io/en/latest/index.html#incompatible-3rd-party-library-licenses?

But that's May be not the problem.
Which exact version did you download?

ES version: 7.3.0
FScrawler: fscrawler-es7-2.7
OS: Win 10

I noticed the doc you mentioned and I had a try on that.
It seems that the link for JPEG2000 support API in doc has some problems. I download it from here and added to lib, however, it turned out no difference and the same warning still showed up.

Do you know which exact build you downloaded?

fscrawler-es7-2.7-20200224.065405-79.zip

Thanks.
Could you share your pdf document so I can test it locally?

Here is one of my sample PDFs

I removed all the pages but this one to run a simple test:

You can see in the image some text.

Here's what has been extracted:

4 > 阿里巴巴云原生实践 15 讲

PaaS 平台的运维流程,给 PaaS 带来更强的面向终态的自动化能力。最后把运行

环境等传统重模式改成原生容器与 pod 的轻量模式,同时将 PaaS 能力完全移交给

Kubernetes controller,从而形成一个完全云原生的架构体系。

如何解决云原生的关键难点

阿里巴巴云原生的探索,起步于自研容器和调度系统,到如今拥抱开源的标准化

技术。对于当下开发者的建议是:如果想构建云原生架构,建议直接从 Kubernetes

入手即可。一方面,Kubernetes 为平台建设者而生,已经成为云原生生态的中流

砥柱,它不仅向下屏蔽了底层细节,而且向上支撑各种周边业务生态;另一方面,更

重要的是社区中有着越来越多围绕 Kubernetes 构建的开源项目,比如 Service 

Mesh、Kubeflow。

那么作为过来人,阿里有哪些“避坑指南”呢?

 

云原生技术架构演进中最为艰难的挑战,其实来自于 Kubernetes 本身的管

理。因为 Kubernetes 相对年轻,其自身的运维管理系统生态尚不完善。对于阿里

而言,数以万计的集群管理至关重要,我们探索并总结了四个方法:Kubernetes on 

Kubernetes,利用 K8s 来管理 K8s 自身;节点发布回滚策略,按规则要求灰度发

4 > WBBEARELEE 15 it

PaaS FAia#ive, 4 PaaS PRERVMALANA MH. BIST
VRS CREAM RESES pod HEMRSs, MINK PaaS RNREBRE
Kubernetes controller, Mmi#ak—*SEEBRENRMAR

WTR REA

WMSECEZREWRR, KOTFRRARIBERS, BMRA RAE
RA. WFAPA RENE: MRM ARESM, IMEI Kubernetes
AFH. —AH, Kubernetess AFRBRAME, CPBKABDRELSH PR
IME, CMMATRRT READ, MAALUSSHAOWSES; SAA, B
BEH2HK ps AwM RMS BS Kubernetes PLHNARMA, tee Service
Mesh, Kubeflow.

BAEAIRA, BAM “EER” 1?

LE Es: oss LE eo ba
ES Sura ca

pet se ea CdSe
Ea

Cloud native works for Alibaba,

PaO CR lca Cle pele

Sd
CNCF TOC

 

BRERARGBH PR ARN, ASK AF Kubernetes ASHE
32, AA Kubernetes BUFR, RBSHBRSBRRESSTRS. WIS
TS, MUAH RHSBSAEE, HRAH SAT OK: Kubernetes on
Kubernetes, #/Fa K8s KBE K8s AS; TRARAR RAM, RMUBKKER


	lRubZ
	pPhhR
	_GoBack
	_GoBack
	_GoBack
	_GoBack
	_GoBack
	_GoBack
	_GoBack
	be3d7aa1
	b317506b
	2cab1bdc

As you can see "Cloud native works for Alibaba" has been extracted.
May be not all the text but at least some parts of it.

Could you share a smaller example that actually does not work at all?

May be you are also hitting the default indexed_chars limit? It's 10000 by default. See https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#extracted-characters

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.