Hello
i guess @dadoonet would know best about this but if someone else can answer this it would be great.
I am currently trying to index pdfs into elasticsearch. I installed the 'Ingest Attachment Processor Plugin' and downloaded fscrawler.zip. I unpacked it, ran bin/fscrawler testjob
, edited the created _settings.yaml to the right url for my pdfs and restarted the job.
This was the output:
10:48:02,119 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [235.9mb/3.8gb=6.01%], RAM [223.6mb/15.3gb=1.43%], Swap [940.8mb/1.9gb=45.94%].
10:48:02,316 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
10:48:02,316 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
10:48:02,662 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 7.17.3
10:48:02,706 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 7.17.3
10:48:05,030 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [testjob] for [/home/administrator/Downloads/Pdfs] every [15m]
10:48:05,344 INFO [f.p.e.c.f.t.TikaInstance] OCR is disabled.
10:48:06,529 WARN [o.a.p.p.f.FileSystemFontProvider] New fonts found, font cache will be re-built
10:48:06,529 WARN [o.a.p.p.f.FileSystemFontProvider] Building on-disk font cache, this may take a while
10:48:11,649 WARN [o.a.p.p.f.FileSystemFontProvider] Finished building on-disk font cache, found 288 fonts
10:48:11,776 WARN [o.a.p.p.f.PDType1Font] Using fallback font LiberationSans for base font Symbol
10:48:11,777 WARN [o.a.p.p.f.PDType1Font] Using fallback font LiberationSans for base font ZapfDingbats
It indexed the documents and so far so good. I see all 3 pdfs in the index and also most of its content. Now I encountered a problem, it only indexed 59 of the 2435 pages (I tried it with a pdf of the bible just for testing).
I dont know what the limiting factor is. Is it elasticsearch only allowing so many charecters or do I have to change some fscrawler setting?
Thanks for any help