AppSearch: crawling takes a long time

andi.avram · May 30, 2023, 8:30am

Dear Elastic community,

For one of our clients we are using AppSearch for our search functionality, and we are using the OOTB AppSearch crawler to crawl our websites.
We observed that crawling takes quite a long time, and we are a bit concerned about this, as it could become an issue, due to our large amount of pages. We would like to set up a scheduler and run the crawler every day, so that any changes to our pages are updated, so we need to be sure that the crawler finishes fast enough.

As an test we crawled a few locales of our website and the crawl generated about 42000 documents. That took about 9 hours to complete as the first time crawl of the website.
Afterwards we re-crawled the websites and it completed in about 6 hours, so faster, but still takes quite a long time.

So, I am trying to understand how the re-crawling works in the background, and how to optimize the crawler so that it runs efficiently. I tried to find this information in the documentation, but did not find what I was looking for.
Does the crawler check if the pages are modified before crawling them?
Also, if a page becomes inactive and we do not have it in the sitemap anymore, will its document be removed if we re-crawl the website?

Thanks in advance for any input,
Andi

Tarek-ZIADE · June 5, 2023, 4:17pm

Hi Andi,
have you tried to tweak your configuration a bit? we have a few options that could be helpful in your case:

crawler.crawl.threads.limit
crawler.crawl.url_queue.url_count.limit
connector.crawler.http.head_requests.enabled

The last one, if activated, will do HEAD calls to check if the resource had changed before it go ahead and index it again. If you have large resources, and depending on the web server you use, this could speed up the crawls.

see

Tarek-ZIADE · June 5, 2023, 4:32pm

scratch that last option, it's not for the AppSearch crawler, sorry for the confusion

SanderP · June 6, 2023, 6:54am

Hello @Tarek-ZIADE

Thanks you still for the information. There are similar configuration options for appsearch, just a bit down: Configuration | Enterprise Search documentation [master] | Elastic Details on removal:: Web crawler reference | App Search documentation [8.8] | Elastic

We tried to find a page what to do to optimize/improve crawling and what could be the root-cause in case of low performance, but guidance is a bit lacking on this. After creation of this discuss topic we noticed that the average response is high, hence our test in a lower tier doesn't seem to be representative also compared to production. Retesting is required on our side, but if someone has any additional details then it would still be helpful

andi.avram · June 6, 2023, 7:32am

Hi both,

Thank you very much for the suggestions.
We will try to do some tests after changing the mentioned configs.

Thanks,
Andi

system · July 4, 2023, 7:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
App Search: Web crawler produces high load and crashes our search Elastic Search elastic-app-search	3	203	June 26, 2024
High CPU usage of App Search Web Crawler Elastic Search elastic-app-search	5	210	July 24, 2024
Start a crawl for just 1 url (page) via API Elastic Search elastic-app-search	3	361	September 3, 2021
Web Crawler API Elastic Search crawler	5	368	July 18, 2024
Building a website search engine with Appsearch Elastic Search elastic-app-search	2	547	November 28, 2019

AppSearch: crawling takes a long time

Related topics