Trying to run multiple "jobs" in fscrawler (docker)

rwitamb · October 13, 2022, 9:29pm

Hey,
I was trying to do some text extraction/OCR stuff with a bunch of PDFs, DOCX, and images. I stumbled across fscrawler and it's been working pretty nicely.

Here's how my setup looks like.

This is the fscrawler service in my docker-compose. I have elasticsearch and kibana running as services through the same compose file.

# fscrawler
fscrawler:
  image: dadoonet/fscrawler:latest
  container_name: fscrawler
  ports:
    - 8080:8080
  depends_on:
    elasticsearch:
      condition: service_healthy
  restart: always
  volumes:
    - ./fscrawler_config:/root/.fscrawler
    - ./fscrawler_logs:/usr/share/fscrawler/logs
    - ./Files/:/tmp/es:ro
  command: fscrawler case_1 --rest

The "Files" folder contains different subfolders which I want to index under different elasticsearch indexes.
The fscrawler_config contains multiple _settings.yaml for the multiple fscrawler jobs I need for indexing under separate indexes.


fscrawler_config
├───case_1
  └───_settings.yaml
├───case_2
  └───_settings.yaml

This is how one of the _settings.yaml look like. The fs.url and elasticsearch.index values are different for each job based on the folder & index I want.

name: "case_1"
fs:
  url: "/tmp/es/Case_1/"
  update_rate: "30s"
  index_folders: true
  index_content: true
  indexed_chars: -1
  lang_detect: true
  continue_on_error: false
  attributes_support: true
  raw_metadata: true
  filename_as_id: false
  json_support: false
  xml_support: false
  store_source: false
  add_as_inner_object: false
  checksum: "SHA-1"
  ocr:
    enabled: true
    language: "eng"
    pdf_strategy: "ocr_and_text"
elasticsearch:
  index: "case_1"
  nodes:
    - url: "http://elasticsearch:9200"
  username: "elastic"
  password: "changeme"
  ssl_verification: false
rest:
  url: "http://fscrawler:8080/fscrawler"
  enable_cors: true

Now, how do I run multiple jobs at once using docker-compose up -d ?
I tried feeding in multiple job names into the command, like command: fscrawler case_1 case_2 --rest, but that just runs the first job.
Is there any way to make the fscrawler watch mode to watch for config changes so that I can dynamically create new config (_settings.yaml) files and it starts new jobs on the fly?

Or is the recommended way to do what I want here is to just run fscrawler as a rest service, and then manually invoke the _document api with the index name specified for each file I want it to process?

system · November 10, 2022, 9:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can fscrawler index files from different servers? Elasticsearch	10	1524	February 12, 2020
[ANNOUNCEMENT] - FSCrawler 2.8 released Community Ecosystem docker	5	1471	February 7, 2022
[ANNOUNCEMENT] - FSCrawler 2.9 released Community Ecosystem docker	1	1271	February 7, 2022
FsCrawler does not do anything, does not index pfd's Elasticsearch	4	1216	March 10, 2022
Fscrawler for ES clustering Elasticsearch	41	2088	March 18, 2020

Trying to run multiple "jobs" in fscrawler (docker)

Related topics