Open Crawler - Scheduled Crawl

alongaks · April 23, 2025, 6:07pm

Hi,

edit: version 0.2.2 crawler

I tried to run a crawl based on a schedule, but I don't think the crawl is starting? It may be something Im missing but wanted to check if anyone else ran into this.

If I run a normal, adhoc crawl things go fine:
docker exec -it crawler bin/crawler crawl config/crawler.yml

[crawl:68090789e16dda75c7e6831c] [primary] Initialized an in-memory URL queue for up to 10000 URLs
[crawl:68090789e16dda75c7e6831c] [primary] ES connections will be authorized with configured API key
[crawl:68090789e16dda75c7e6831c] [primary] ES connections will use SSL with ca_fingerprint
[crawl:68090789e16dda75c7e6831c] [primary] Connected to ES at https://elastic:9200 - version: 8.17.2; build flavor: default
[crawl:68090789e16dda75c7e6831c] [primary] Index [contoso-search-testindex] was found!
[crawl:68090789e16dda75c7e6831c] [primary] Elasticsearch sink initialized for index [contoso-search-testindex] with pipeline [contoso-search-testindex@custom]
[crawl:68090789e16dda75c7e6831c] [primary] Starting the primary crawl with up to 1 parallel thread(s)...
[crawl:68090789e16dda75c7e6831c] [primary] Crawl status: queue_size=2, pages_visited=1, urls_allowed=3, urls_denied={}, crawl_duration_msec=3216, crawling_time_msec=1967.0, avg_response_time_msec=1967.0, active_threads=1, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"200"=>1}
[crawl:68090789e16dda75c7e6831c] [primary] Crawl status: queue_size=11, pages_visited=6, urls_allowed=16, urls_denied={:already_seen=>32, :domain_filter_denied=>10, :incorrect_protocol=>6}, crawl_duration_msec=13577, crawling_time_msec=3230.0, avg_response_time_msec=538.3333333333334, active_threads=0, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"200"=>6}
...
[crawl:68090789e16dda75c7e6831c] [primary] Crawl queue is empty, finishing the primary crawl
[crawl:68090789e16dda75c7e6831c] [primary] Sending bulk request with 39 items and resetting queue...
[crawl:68090789e16dda75c7e6831c] [primary] Successfully indexed 39 docs.
[crawl:68090789e16dda75c7e6831c] [primary] Finished a crawl stage. Result: success; Successfully finished the primary crawl with an empty crawl queue
[crawl:68090789e16dda75c7e6831c] [primary] No documents were found for the purge crawl. Skipping purge crawl.
[crawl:68090789e16dda75c7e6831c] [primary] Finished a crawl stage. Result: success; Skipped purge crawl as no outdated documents were found.
[crawl:68090789e16dda75c7e6831c] [primary] Closing the output sink before finishing the crawl...
[crawl:68090789e16dda75c7e6831c] [primary] All indexing operations completed. Successfully upserted 39 docs with a volume of 139808 bytes. Failed to index 0 docs with a volume of 0 bytes. Deleted 0 outdated docs from the index.
[crawl:68090789e16dda75c7e6831c] [primary] Crawl status: queue_size=0, pages_visited=220, urls_allowed=219, urls_denied={:already_seen=>673, :domain_filter_denied=>136, :incorrect_protocol=>78, :nofollow=>102}, crawl_duration_msec=273283, crawling_time_msec=32698.0, avg_response_time_msec=148.62727272727273, active_threads=0, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"200"=>219, "404"=>1}
[crawl:68090789e16dda75c7e6831c] [primary] Crawl shutdown complete
[crawl:68090789e16dda75c7e6831c] [primary] Finished a crawl. Result: success; Successfully finished the primary crawl with an empty crawl queue | Skipped purge crawl as no outdated documents were found.

If I try to run the crawl as a scheduled crawl, it seems to start as its idle instance, waiting for the time to trigger it to crawl:
docker exec -it crawler bin/crawler schedule config/crawler.yml [crawl:68092445e16dda994aa81779] [primary] Crawler initialized with a cron schedule of 40 12 * * *

But it just sits there and doesn't print progress. If I query the index after ~10-20min it remains empty. This is a very small doc crawl of 40 docs and is usually done within about 5min or so.

nfeekery · April 24, 2025, 11:56pm

Hi @alongaks

Looking at the log output it seem this is your configured cron schedule. Can you confirm this is true? Because this equals every day at 12:40pm exactly (at least, according to crontab.guru)

A quick way to test if it's only the configured cron schedule is to put in something that will run more frequently, like */5 * * * * - this should run once every 5 minutes.

alongaks · April 25, 2025, 5:19pm

Hello, Navarone

Thank you for the response. Yes, this is intended. The schedule is/was set to run just a few moments after kicking off the crawl.

For example, with this test, I had the following in the crawler.yml file:

schedule:
   pattern: "40 12 * * *"

At the time the test was begun the local time was 12:35pm EST. There are no other crawlers configured. The intention was to run the crawl like so:
docker exec -it crawler bin/crawler schedule config/crawler.yml

Which it does, and it confirms that it is reading the intended scheduled crawl:
[crawl:68092445e16dda994aa81779] [primary] Crawler initialized with a cron schedule of 40 12 * * *

... but when the system time reaches the 12:40pm time it seems the crawl does not start. There is no other output on the commandline. Presumably since I am running it "attached" and not backgrounded it would print like any other similar crawl would.

nfeekery · April 28, 2025, 6:21am

@alongaks thanks for getting back to me.

I tried out a few different schedule patterns (including the pattern you provided) both on my local machine and they worked fine. Then, in a Docker container, I tried again and I think I reproduced the issue.

In my setup, it turns out my Docker container was running on a different timezone than my machine. You can confirm this by exec-ing into the Docker container and running date, e.g.:

$ date
> Mon Apr 28 06:10:11 UTC 2025

I was able to get it working by passing in an env variable with -e TZ=<timezone name> when running the Docker image. Interestingly calling date within the container still showed the wrong time but the timezone value was updated, which is what Crawler references. I used the timezone Asia/Tokyo, now my date output has Asia instead of `UTC.

$ date
> Mon Apr 28 06:22:31 Asia 2025

Full command I used to run Crawler with the timezone env change:

docker run -it \
    -e TZ=Asia/Tokyo 
    -v ./crawl.yml:/crawl.yml 
    docker.elastic.co/integrations/crawler:latest 
    jruby bin/crawler schedule /crawl.yml

(^ this command is what's used in the current quickstart guide, so alter it as you need based on how you're running things).

Let me know if that works or not.

alongaks · April 28, 2025, 12:42pm

@nfeekery thank you very much!

I forgot all the scheduling in the current Enterprise Search crawler is UTC, but that got it going. Refrencing the time - and making a note - as UTC will work.

Thanks again.

Topic		Replies	Views
I look for a Docker Curator 4.* with cron Elasticsearch	1	563	April 17, 2017
Suggestions for places to start for a crawler? Elasticsearch	7	1556	July 6, 2017
2AM to 3AM connectivity issues Elasticsearch	5	603	July 6, 2017
Scheduling Elasticseach Snapshot Elasticsearch	2	361	July 5, 2017
Fscrawler Elasticsearch docker	15	2317	February 24, 2021

Open Crawler - Scheduled Crawl

Related topics