High CPU usage of App Search Web Crawler

Sean_Story · June 25, 2024, 8:54pm

This part does seem odd. I like to believe that the crawler is fast, but 2764 documents in 1ms seems too fast. Where are you getting the timing from? I'd suggest looking at your crawler event logs to see when a crawl starts and finishes.

Mostly, yes, spikes in CPU and network are normal. The App Search and Elastic Web Crawlers both store a lot of state about the crawl in Elasticcsearch indices. This makes the crawls very resilient - a node can shut down mid-crawl, and another node can pick it up where it had left off, without rework. This comes with a cost though, of a LOT of network traffic between the Enterprise Search App Server (which runs the crawlers) and Elasticsearch as all that state is persisted, updated, and fetched.

You may be interested to read about the recently announced Open Web Crawler, which has some significant performance boosts in these areas, as it does away with that state persistence. See: Open Crawler released for tech-preview — Elastic Search Labs .

Not directly, but you may want to try changing:

crawler.crawl.threads.limit: 10

Lowering this may slow down your crawls, but will also limit how many threads the JVM tries to use for a given crawl.

There aren't a lot of great options for you here, unfortunately.

You could self-manage enterprise-search nodes, so that you could have more control over the resources they have. Crawler for instance can't really benefit from having multiple Enterprise Search nodes, since a given crawl will only run on a single node. But you may want more (but smaller) nodes to balance search requests.
You could pivot to using the new Open Crawler. You could use App Search Elasticsearch Engines instead, and probably very little would change on the search side. But you'd need to self-manage the crawler. Also you'd have to accept that tradeoff of performance-for-resiliency. And the Open Crawler is not yet GA.
You can use more resources. This can obviously get quite spendy.
You can perform fewer/smaller crawls. If you're not already using them, Partial Crawls are a powerful tool to limit the size and scope of a given crawl. If you can specify depth=1 and enumerate all the pages that have changed as entrypoints, your CPU utilization and the overall crawl time should go way down.

Hope this helps!

Topic		Replies	Views
Elasticsearch High CPU usage Elasticsearch	3	269	March 10, 2024
AppSearch: crawling takes a long time Elastic Search elastic-app-search	5	345	July 4, 2023
App Search: Web crawler produces high load and crashes our search Elastic Search elastic-app-search	3	222	June 26, 2024
Is 3k search/sec high volumn? (High CPU usage) Elasticsearch	2	689	July 5, 2017
All of a sudden Elastic Search, 100% CPU usage and high memory usage Elasticsearch	3	2995	September 22, 2021

High CPU usage of App Search Web Crawler

Related topics