This n00b is thoroughly confused.
I have severa quite large pdf files I want to be able to full text search. They comrpise of raw text and images contaning text.
So I want to index the text in the pdf, OCR the images and index the result as well.
To this end I have setup an Ubuntu box with Elasticsearch and Kibana, and setup FSCrawler.
I get the json result from Elasticsearch and the kibaba website is functioning without issues.
I have put all PDF's in a folder, and created a fscrawler job to index them.
The job runs, but it appears to be doing nothing at all. It just sits there.
In kibana I can see an index, but it appears to be empty.
Here the fscrawler job:
---
name: "resumes"
fs:
url: "/root/data/"
update_rate: "1m"
includes:
- "*/*.pdf"
json_support: true
filename_as_id: true
add_filesize: true
remove_deleted: true
add_as_inner_object: true
store_source: true
index_content: true
attributes_support: false
raw_metadata: true
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: true
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: true
elasticsearch:
nodes:
- url: "http://192.168.1.126:8000"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
ssl_verification: false
This is what I get when this job runs.. after some time, more of the same:
[f.console] ,----------------------------------------------------------------------------------------------------.
| ,---,. .--.--. ,----.. ,--, 2.10-SNAPSHOT |
| ,' .' | / / '. / / \ ,--.'| |
| ,---.' || : /`. / | : : __ ,-. .---.| | : __ ,-. |
| | | .'; | |--` . | ;. /,' ,'/ /| /. ./|: : ' ,' ,'/ /| |
| : : : | : ;_ . ; /--` ' | |' | ,--.--. .-'-. ' || ' | ,---. ' | |' | |
| : | |-, \ \ `. ; | ; | | ,'/ \ /___/ \: |' | | / \ | | ,' |
| | : ;/| `----. \| : | ' : / .--. .-. | .-'.. ' ' .| | : / / |' : / |
| | | .' __ \ \ |. | '___ | | ' \__\/: . ./___/ \: '' : |__ . ' / || | ' |
| ' : ' / /`--' /' ; : .'|; : | ," .--.; |. \ ' .\ | | '.'|' ; /|; : | |
| | | | '--'. / ' | '/ :| , ; / / ,. | \ \ ' \ |; : ;' | / || , ; |
| | : \ `--'---' | : / ---' ; : .' \ \ \ |--" | , / | : | ---' |
| | | ,' \ \ .' | , .-./ \ \ | ---`-' \ \ / |
| `----' `---` `--`---' '---" `----' |
+----------------------------------------------------------------------------------------------------+
| You know, for Files! |
| Made from France with Love |
| Source: https://github.com/dadoonet/fscrawler/ |
| Documentation: https://fscrawler.readthedocs.io/ |
`----------------------------------------------------------------------------------------------------'
15:37:07,840 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [34.3mb/949.3mb=3.62%], RAM [194mb/3.8gb=4.94%], Swap [0b/0b=0.0].
15:37:09,229 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
15:37:09,234 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
15:37:11,898 INFO [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.0
15:37:12,380 INFO [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.0
15:37:13,156 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [resumes] for [/root/data/] every [1m]
15:37:13,318 WARN [o.e.c.RestClient] request [POST http://192.168.1.126:8000/resumes/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.0-bee86328705acaa9a6daede7140defd4d9ec56bd "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
15:37:13,365 WARN [o.e.c.RestClient] request [POST http://192.168.1.126:8000/resumes_folder/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.0-bee86328705acaa9a6daede7140defd4d9ec56bd "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
I cannot seam to get more verbose output. I would have expected some output like:
"found file input.pdf"
"scanning an indexing text"
"found image, applying ocr"
....
or something like that. What am I doing wrong?
I used the latest versions of everything.