Trying to run multiple "jobs" in fscrawler (docker)

Hey,
I was trying to do some text extraction/OCR stuff with a bunch of PDFs, DOCX, and images. I stumbled across fscrawler and it's been working pretty nicely.

Here's how my setup looks like.

This is the fscrawler service in my docker-compose. I have elasticsearch and kibana running as services through the same compose file.

# fscrawler
fscrawler:
  image: dadoonet/fscrawler:latest
  container_name: fscrawler
  ports:
    - 8080:8080
  depends_on:
    elasticsearch:
      condition: service_healthy
  restart: always
  volumes:
    - ./fscrawler_config:/root/.fscrawler
    - ./fscrawler_logs:/usr/share/fscrawler/logs
    - ./Files/:/tmp/es:ro
  command: fscrawler case_1 --rest

The "Files" folder contains different subfolders which I want to index under different elasticsearch indexes.
The fscrawler_config contains multiple _settings.yaml for the multiple fscrawler jobs I need for indexing under separate indexes.


fscrawler_config
├───case_1
  └───_settings.yaml
├───case_2
  └───_settings.yaml

This is how one of the _settings.yaml look like. The fs.url and elasticsearch.index values are different for each job based on the folder & index I want.

name: "case_1"
fs:
  url: "/tmp/es/Case_1/"
  update_rate: "30s"
  index_folders: true
  index_content: true
  indexed_chars: -1
  lang_detect: true
  continue_on_error: false
  attributes_support: true
  raw_metadata: true
  filename_as_id: false
  json_support: false
  xml_support: false
  store_source: false
  add_as_inner_object: false
  checksum: "SHA-1"
  ocr:
    enabled: true
    language: "eng"
    pdf_strategy: "ocr_and_text"
elasticsearch:
  index: "case_1"
  nodes:
    - url: "http://elasticsearch:9200"
  username: "elastic"
  password: "changeme"
  ssl_verification: false
rest:
  url: "http://fscrawler:8080/fscrawler"
  enable_cors: true

Now, how do I run multiple jobs at once using docker-compose up -d ?
I tried feeding in multiple job names into the command, like command: fscrawler case_1 case_2 --rest, but that just runs the first job.
Is there any way to make the fscrawler watch mode to watch for config changes so that I can dynamically create new config (_settings.yaml) files and it starts new jobs on the fly?

Or is the recommended way to do what I want here is to just run fscrawler as a rest service, and then manually invoke the _document api with the index name specified for each file I want it to process?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.