For one of our clients we are using AppSearch for our search functionality, and we are using the OOTB AppSearch crawler to crawl our websites.
We observed that crawling takes quite a long time, and we are a bit concerned about this, as it could become an issue, due to our large amount of pages. We would like to set up a scheduler and run the crawler every day, so that any changes to our pages are updated, so we need to be sure that the crawler finishes fast enough.
As an test we crawled a few locales of our website and the crawl generated about 42000 documents. That took about 9 hours to complete as the first time crawl of the website.
Afterwards we re-crawled the websites and it completed in about 6 hours, so faster, but still takes quite a long time.
So, I am trying to understand how the re-crawling works in the background, and how to optimize the crawler so that it runs efficiently. I tried to find this information in the documentation, but did not find what I was looking for.
Does the crawler check if the pages are modified before crawling them?
Also, if a page becomes inactive and we do not have it in the sitemap anymore, will its document be removed if we re-crawl the website?
Hi Andi,
have you tried to tweak your configuration a bit? we have a few options that could be helpful in your case:
crawler.crawl.threads.limit
crawler.crawl.url_queue.url_count.limit
connector.crawler.http.head_requests.enabled
The last one, if activated, will do HEAD calls to check if the resource had changed before it go ahead and index it again. If you have large resources, and depending on the web server you use, this could speed up the crawls.
We tried to find a page what to do to optimize/improve crawling and what could be the root-cause in case of low performance, but guidance is a bit lacking on this. After creation of this discuss topic we noticed that the average response is high, hence our test in a lower tier doesn't seem to be representative also compared to production. Retesting is required on our side, but if someone has any additional details then it would still be helpful
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.