This part does seem odd. I like to believe that the crawler is fast, but 2764 documents in 1ms seems too fast. Where are you getting the timing from? I'd suggest looking at your crawler event logs to see when a crawl starts and finishes.
Mostly, yes, spikes in CPU and network are normal. The App Search and Elastic Web Crawlers both store a lot of state about the crawl in Elasticcsearch indices. This makes the crawls very resilient - a node can shut down mid-crawl, and another node can pick it up where it had left off, without rework. This comes with a cost though, of a LOT of network traffic between the Enterprise Search App Server (which runs the crawlers) and Elasticsearch as all that state is persisted, updated, and fetched.
You may be interested to read about the recently announced Open Web Crawler, which has some significant performance boosts in these areas, as it does away with that state persistence. See: Open Crawler released for tech-preview — Elastic Search Labs .
Not directly, but you may want to try changing:
crawler.crawl.threads.limit: 10
Lowering this may slow down your crawls, but will also limit how many threads the JVM tries to use for a given crawl.
There aren't a lot of great options for you here, unfortunately.
- You could self-manage enterprise-search nodes, so that you could have more control over the resources they have. Crawler for instance can't really benefit from having multiple Enterprise Search nodes, since a given crawl will only run on a single node. But you may want more (but smaller) nodes to balance search requests.
- You could pivot to using the new Open Crawler. You could use App Search Elasticsearch Engines instead, and probably very little would change on the search side. But you'd need to self-manage the crawler. Also you'd have to accept that tradeoff of performance-for-resiliency. And the Open Crawler is not yet GA.
- You can use more resources. This can obviously get quite spendy.
- You can perform fewer/smaller crawls. If you're not already using them, Partial Crawls are a powerful tool to limit the size and scope of a given crawl. If you can specify
depth=1and enumerate all the pages that have changed as entrypoints, your CPU utilization and the overall crawl time should go way down.
Hope this helps!