Hey,
I was trying to do some text extraction/OCR stuff with a bunch of PDFs, DOCX, and images. I stumbled across fscrawler and it's been working pretty nicely.
Here's how my setup looks like.
This is the fscrawler service in my docker-compose. I have elasticsearch and kibana running as services through the same compose file.
# fscrawler
fscrawler:
image: dadoonet/fscrawler:latest
container_name: fscrawler
ports:
- 8080:8080
depends_on:
elasticsearch:
condition: service_healthy
restart: always
volumes:
- ./fscrawler_config:/root/.fscrawler
- ./fscrawler_logs:/usr/share/fscrawler/logs
- ./Files/:/tmp/es:ro
command: fscrawler case_1 --rest
The "Files" folder contains different subfolders which I want to index under different elasticsearch indexes.
The fscrawler_config contains multiple _settings.yaml for the multiple fscrawler jobs I need for indexing under separate indexes.
fscrawler_config
├───case_1
└───_settings.yaml
├───case_2
└───_settings.yaml
This is how one of the _settings.yaml look like. The fs.url and elasticsearch.index values are different for each job based on the folder & index I want.
name: "case_1"
fs:
url: "/tmp/es/Case_1/"
update_rate: "30s"
index_folders: true
index_content: true
indexed_chars: -1
lang_detect: true
continue_on_error: false
attributes_support: true
raw_metadata: true
filename_as_id: false
json_support: false
xml_support: false
store_source: false
add_as_inner_object: false
checksum: "SHA-1"
ocr:
enabled: true
language: "eng"
pdf_strategy: "ocr_and_text"
elasticsearch:
index: "case_1"
nodes:
- url: "http://elasticsearch:9200"
username: "elastic"
password: "changeme"
ssl_verification: false
rest:
url: "http://fscrawler:8080/fscrawler"
enable_cors: true
Now, how do I run multiple jobs at once using docker-compose up -d
?
I tried feeding in multiple job names into the command, like command: fscrawler case_1 case_2 --rest
, but that just runs the first job.
Is there any way to make the fscrawler watch mode to watch for config changes so that I can dynamically create new config (_settings.yaml) files and it starts new jobs on the fly?
Or is the recommended way to do what I want here is to just run fscrawler as a rest service, and then manually invoke the _document api with the index name specified for each file I want it to process?