Hi,
edit: version 0.2.2 crawler
I tried to run a crawl based on a schedule, but I don't think the crawl is starting? It may be something Im missing but wanted to check if anyone else ran into this.
If I run a normal, adhoc crawl things go fine:
docker exec -it crawler bin/crawler crawl config/crawler.yml
[crawl:68090789e16dda75c7e6831c] [primary] Initialized an in-memory URL queue for up to 10000 URLs
[crawl:68090789e16dda75c7e6831c] [primary] ES connections will be authorized with configured API key
[crawl:68090789e16dda75c7e6831c] [primary] ES connections will use SSL with ca_fingerprint
[crawl:68090789e16dda75c7e6831c] [primary] Connected to ES at https://elastic:9200 - version: 8.17.2; build flavor: default
[crawl:68090789e16dda75c7e6831c] [primary] Index [contoso-search-testindex] was found!
[crawl:68090789e16dda75c7e6831c] [primary] Elasticsearch sink initialized for index [contoso-search-testindex] with pipeline [contoso-search-testindex@custom]
[crawl:68090789e16dda75c7e6831c] [primary] Starting the primary crawl with up to 1 parallel thread(s)...
[crawl:68090789e16dda75c7e6831c] [primary] Crawl status: queue_size=2, pages_visited=1, urls_allowed=3, urls_denied={}, crawl_duration_msec=3216, crawling_time_msec=1967.0, avg_response_time_msec=1967.0, active_threads=1, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"200"=>1}
[crawl:68090789e16dda75c7e6831c] [primary] Crawl status: queue_size=11, pages_visited=6, urls_allowed=16, urls_denied={:already_seen=>32, :domain_filter_denied=>10, :incorrect_protocol=>6}, crawl_duration_msec=13577, crawling_time_msec=3230.0, avg_response_time_msec=538.3333333333334, active_threads=0, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"200"=>6}
...
[crawl:68090789e16dda75c7e6831c] [primary] Crawl queue is empty, finishing the primary crawl
[crawl:68090789e16dda75c7e6831c] [primary] Sending bulk request with 39 items and resetting queue...
[crawl:68090789e16dda75c7e6831c] [primary] Successfully indexed 39 docs.
[crawl:68090789e16dda75c7e6831c] [primary] Finished a crawl stage. Result: success; Successfully finished the primary crawl with an empty crawl queue
[crawl:68090789e16dda75c7e6831c] [primary] No documents were found for the purge crawl. Skipping purge crawl.
[crawl:68090789e16dda75c7e6831c] [primary] Finished a crawl stage. Result: success; Skipped purge crawl as no outdated documents were found.
[crawl:68090789e16dda75c7e6831c] [primary] Closing the output sink before finishing the crawl...
[crawl:68090789e16dda75c7e6831c] [primary] All indexing operations completed. Successfully upserted 39 docs with a volume of 139808 bytes. Failed to index 0 docs with a volume of 0 bytes. Deleted 0 outdated docs from the index.
[crawl:68090789e16dda75c7e6831c] [primary] Crawl status: queue_size=0, pages_visited=220, urls_allowed=219, urls_denied={:already_seen=>673, :domain_filter_denied=>136, :incorrect_protocol=>78, :nofollow=>102}, crawl_duration_msec=273283, crawling_time_msec=32698.0, avg_response_time_msec=148.62727272727273, active_threads=0, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"200"=>219, "404"=>1}
[crawl:68090789e16dda75c7e6831c] [primary] Crawl shutdown complete
[crawl:68090789e16dda75c7e6831c] [primary] Finished a crawl. Result: success; Successfully finished the primary crawl with an empty crawl queue | Skipped purge crawl as no outdated documents were found.
If I try to run the crawl as a scheduled crawl, it seems to start as its idle instance, waiting for the time to trigger it to crawl:
docker exec -it crawler bin/crawler schedule config/crawler.yml [crawl:68092445e16dda994aa81779] [primary] Crawler initialized with a cron schedule of 40 12 * * *
But it just sits there and doesn't print progress. If I query the index after ~10-20min it remains empty. This is a very small doc crawl of 40 docs and is usually done within about 5min or so.