Dear Elastic community,
For one of our clients we are using AppSearch for our search functionality, and we are using the OOTB AppSearch crawler to crawl our websites.
We observed that crawling takes quite a long time, and we are a bit concerned about this, as it could become an issue, due to our large amount of pages. We would like to set up a scheduler and run the crawler every day, so that any changes to our pages are updated, so we need to be sure that the crawler finishes fast enough.
As an test we crawled a few locales of our website and the crawl generated about 42000 documents. That took about 9 hours to complete as the first time crawl of the website.
Afterwards we re-crawled the websites and it completed in about 6 hours, so faster, but still takes quite a long time.
So, I am trying to understand how the re-crawling works in the background, and how to optimize the crawler so that it runs efficiently. I tried to find this information in the documentation, but did not find what I was looking for.
Does the crawler check if the pages are modified before crawling them?
Also, if a page becomes inactive and we do not have it in the sitemap anymore, will its document be removed if we re-crawl the website?
Thanks in advance for any input,
have you tried to tweak your configuration a bit? we have a few options that could be helpful in your case:
The last one, if activated, will do HEAD calls to check if the resource had changed before it go ahead and index it again. If you have large resources, and depending on the web server you use, this could speed up the crawls.
scratch that last option, it's not for the AppSearch crawler, sorry for the confusion
Thanks you still for the information. There are similar configuration options for appsearch, just a bit down: Configuration | Enterprise Search documentation [master] | Elastic Details on removal:: Web crawler reference | App Search documentation [8.8] | Elastic
We tried to find a page what to do to optimize/improve crawling and what could be the root-cause in case of low performance, but guidance is a bit lacking on this. After creation of this discuss topic we noticed that the average response is high, hence our test in a lower tier doesn't seem to be representative also compared to production. Retesting is required on our side, but if someone has any additional details then it would still be helpful
Thank you very much for the suggestions.
We will try to do some tests after changing the mentioned configs.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.