We’re reaching out for help regarding our current Elastic Enterprise Search setup, which we have been using to implement a page search functionality on our website by indexing pages through the use of App search Web Crawler. Our setup involves using an Elastic Cloud service with the App Search application.
Our Elastic Cloud service configuration is designed with autoscaling to dynamically adjust capacity as needed. Despite this, we've observed some challenges with resource utilization that we'd like to address:
On average, our CPU usage is at around 10%, with memory pressure at 19%. However, we experience significant spikes in resource usage during our web crawling process. During these spikes, CPU usage increases to 200% and requests made go from 400 to 200.000, pushing the system to utilize CPU credits. This surge seems to last around 30 minutes even though the crawls report taking 1 ms
On the other hand our storage and memory seem to be underutilized, with 14% of storage space being used and memory usage peaking at only 27%. Which raises question on how CPU seems to bottleneck our current setup due to the web crawling.
We wanted to know if the spikes in CPU usage during web crawling activities are normal, or if there is some configuration that we have wrong that causes this (Although the App search web interface for web crawling doesn’t have that many configuration options, and i can't seem to find other Elastic Crawler interfaces). Are there any ways to optimize this process?
Our Elastic configuration is:
Using Enterprise Search App Search
Elastic Deployment Version is v8.3.3
Using an App search engine with 2764 indexed documents and 18 fields.
Using automatic web crawling every 5 hours
The Web crawler is set up in the App search Engine Interface, there is no external config or implementation.
Duplicate document Handling has the following fields configured
This part does seem odd. I like to believe that the crawler is fast, but 2764 documents in 1ms seems too fast. Where are you getting the timing from? I'd suggest looking at your crawler event logs to see when a crawl starts and finishes.
Mostly, yes, spikes in CPU and network are normal. The App Search and Elastic Web Crawlers both store a lot of state about the crawl in Elasticcsearch indices. This makes the crawls very resilient - a node can shut down mid-crawl, and another node can pick it up where it had left off, without rework. This comes with a cost though, of a LOT of network traffic between the Enterprise Search App Server (which runs the crawlers) and Elasticsearch as all that state is persisted, updated, and fetched.
You may be interested to read about the recently announced Open Web Crawler, which has some significant performance boosts in these areas, as it does away with that state persistence. See: Open Crawler released for tech-preview — Elastic Search Labs .
Not directly, but you may want to try changing:
crawler.crawl.threads.limit: 10
Lowering this may slow down your crawls, but will also limit how many threads the JVM tries to use for a given crawl.
There aren't a lot of great options for you here, unfortunately.
You could self-manage enterprise-search nodes, so that you could have more control over the resources they have. Crawler for instance can't really benefit from having multiple Enterprise Search nodes, since a given crawl will only run on a single node. But you may want more (but smaller) nodes to balance search requests.
You could pivot to using the new Open Crawler. You could use App Search Elasticsearch Engines instead, and probably very little would change on the search side. But you'd need to self-manage the crawler. Also you'd have to accept that tradeoff of performance-for-resiliency. And the Open Crawler is not yet GA.
You can use more resources. This can obviously get quite spendy.
You can perform fewer/smaller crawls. If you're not already using them, Partial Crawls are a powerful tool to limit the size and scope of a given crawl. If you can specify depth=1 and enumerate all the pages that have changed as entrypoints, your CPU utilization and the overall crawl time should go way down.
The Open Crawler sure seems interesting, im liking those CPU decreases shown in the preview!
I do have one question that im hoping you can help with: You mention configuration of a crawler's thread amount
Are these options only available if we use the web crawler API? our current use case is only using the UI interface in App search so i assume that to use these features we would have to extend our development for the web crawler. (I assume that for partial crawls we would just have to run them manually on the interface)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.