Open Crawler - Scheduled Crawl

Hi,

edit: version 0.2.2 crawler

I tried to run a crawl based on a schedule, but I don't think the crawl is starting? It may be something Im missing but wanted to check if anyone else ran into this.

If I run a normal, adhoc crawl things go fine:
docker exec -it crawler bin/crawler crawl config/crawler.yml

[crawl:68090789e16dda75c7e6831c] [primary] Initialized an in-memory URL queue for up to 10000 URLs
[crawl:68090789e16dda75c7e6831c] [primary] ES connections will be authorized with configured API key
[crawl:68090789e16dda75c7e6831c] [primary] ES connections will use SSL with ca_fingerprint
[crawl:68090789e16dda75c7e6831c] [primary] Connected to ES at https://elastic:9200 - version: 8.17.2; build flavor: default
[crawl:68090789e16dda75c7e6831c] [primary] Index [contoso-search-testindex] was found!
[crawl:68090789e16dda75c7e6831c] [primary] Elasticsearch sink initialized for index [contoso-search-testindex] with pipeline [contoso-search-testindex@custom]
[crawl:68090789e16dda75c7e6831c] [primary] Starting the primary crawl with up to 1 parallel thread(s)...
[crawl:68090789e16dda75c7e6831c] [primary] Crawl status: queue_size=2, pages_visited=1, urls_allowed=3, urls_denied={}, crawl_duration_msec=3216, crawling_time_msec=1967.0, avg_response_time_msec=1967.0, active_threads=1, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"200"=>1}
[crawl:68090789e16dda75c7e6831c] [primary] Crawl status: queue_size=11, pages_visited=6, urls_allowed=16, urls_denied={:already_seen=>32, :domain_filter_denied=>10, :incorrect_protocol=>6}, crawl_duration_msec=13577, crawling_time_msec=3230.0, avg_response_time_msec=538.3333333333334, active_threads=0, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"200"=>6}
...
[crawl:68090789e16dda75c7e6831c] [primary] Crawl queue is empty, finishing the primary crawl
[crawl:68090789e16dda75c7e6831c] [primary] Sending bulk request with 39 items and resetting queue...
[crawl:68090789e16dda75c7e6831c] [primary] Successfully indexed 39 docs.
[crawl:68090789e16dda75c7e6831c] [primary] Finished a crawl stage. Result: success; Successfully finished the primary crawl with an empty crawl queue
[crawl:68090789e16dda75c7e6831c] [primary] No documents were found for the purge crawl. Skipping purge crawl.
[crawl:68090789e16dda75c7e6831c] [primary] Finished a crawl stage. Result: success; Skipped purge crawl as no outdated documents were found.
[crawl:68090789e16dda75c7e6831c] [primary] Closing the output sink before finishing the crawl...
[crawl:68090789e16dda75c7e6831c] [primary] All indexing operations completed. Successfully upserted 39 docs with a volume of 139808 bytes. Failed to index 0 docs with a volume of 0 bytes. Deleted 0 outdated docs from the index.
[crawl:68090789e16dda75c7e6831c] [primary] Crawl status: queue_size=0, pages_visited=220, urls_allowed=219, urls_denied={:already_seen=>673, :domain_filter_denied=>136, :incorrect_protocol=>78, :nofollow=>102}, crawl_duration_msec=273283, crawling_time_msec=32698.0, avg_response_time_msec=148.62727272727273, active_threads=0, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"200"=>219, "404"=>1}
[crawl:68090789e16dda75c7e6831c] [primary] Crawl shutdown complete
[crawl:68090789e16dda75c7e6831c] [primary] Finished a crawl. Result: success; Successfully finished the primary crawl with an empty crawl queue | Skipped purge crawl as no outdated documents were found.

If I try to run the crawl as a scheduled crawl, it seems to start as its idle instance, waiting for the time to trigger it to crawl:
docker exec -it crawler bin/crawler schedule config/crawler.yml [crawl:68092445e16dda994aa81779] [primary] Crawler initialized with a cron schedule of 40 12 * * *

But it just sits there and doesn't print progress. If I query the index after ~10-20min it remains empty. This is a very small doc crawl of 40 docs and is usually done within about 5min or so.

Hi @alongaks

Looking at the log output it seem this is your configured cron schedule. Can you confirm this is true? Because this equals every day at 12:40pm exactly (at least, according to crontab.guru)

A quick way to test if it's only the configured cron schedule is to put in something that will run more frequently, like */5 * * * * - this should run once every 5 minutes.

Hello, Navarone

Thank you for the response. Yes, this is intended. The schedule is/was set to run just a few moments after kicking off the crawl.

For example, with this test, I had the following in the crawler.yml file:

schedule:
   pattern: "40 12 * * *"

At the time the test was begun the local time was 12:35pm EST. There are no other crawlers configured. The intention was to run the crawl like so:
docker exec -it crawler bin/crawler schedule config/crawler.yml

Which it does, and it confirms that it is reading the intended scheduled crawl:
[crawl:68092445e16dda994aa81779] [primary] Crawler initialized with a cron schedule of 40 12 * * *

... but when the system time reaches the 12:40pm time it seems the crawl does not start. There is no other output on the commandline. Presumably since I am running it "attached" and not backgrounded it would print like any other similar crawl would.

@alongaks thanks for getting back to me.

I tried out a few different schedule patterns (including the pattern you provided) both on my local machine and they worked fine. Then, in a Docker container, I tried again and I think I reproduced the issue.

In my setup, it turns out my Docker container was running on a different timezone than my machine. You can confirm this by exec-ing into the Docker container and running date, e.g.:

$ date
> Mon Apr 28 06:10:11 UTC 2025

I was able to get it working by passing in an env variable with -e TZ=<timezone name> when running the Docker image. Interestingly calling date within the container still showed the wrong time but the timezone value was updated, which is what Crawler references. I used the timezone Asia/Tokyo, now my date output has Asia instead of `UTC.

$ date
> Mon Apr 28 06:22:31 Asia 2025

Full command I used to run Crawler with the timezone env change:

docker run -it \
    -e TZ=Asia/Tokyo 
    -v ./crawl.yml:/crawl.yml 
    docker.elastic.co/integrations/crawler:latest 
    jruby bin/crawler schedule /crawl.yml

(^ this command is what's used in the current quickstart guide, so alter it as you need based on how you're running things).

Let me know if that works or not.

@nfeekery thank you very much!

I forgot all the scheduling in the current Enterprise Search crawler is UTC, but that got it going. Refrencing the time - and making a note - as UTC will work.

Thanks again.