we are using App Search and currently have 4 engines and plan to add any more. Documents to the engine are indexed with the App Search webcrawlers.
We have noticed that our search service is often unavailable. After some research we found out that the crawlers are producing a lot of load and causes the elasticsearch to restart. The more engines we add (more crawlers) the worse this problem becomes.
Every crawler is going through all our sites (arount 10k urls) once a day to update the data for the search engine. We can see that whenever the crawler is running our search service becomes unavailable. I don't think its the correct solution to increasing the size of the elasticsearch just for the crawlers. It's already a big elasticsearch (cost: 1k usd per month) and if the crawlers are not running, there is barely any load.
Do you have other ideas to fix our issue? Is there anyway to slow the crawler down when we reach a high load to avoid crashing?
There's definitely a balancing act to be performed here. You need to make some tradeoff decisions between dimensions of:
how fast do you need changes picked up and searchable?
how many documents can you afford to ingest?
how much query traffic can you afford to support?
can you afford periodic downtime?
Definitely, you can't expect to have an infinitely large number of documents constantly being ingested from the crawler, on the smallest Elasticsearch cluster, without it impacting your search user experience.
Things you may want to look into:
Do you need multiple engines all crawling the same sites? Typically, you'd have different data in different engines. Try to make sure you're not redundantly crawling the same pages multiple times. Make sure you've read about Meta Engines.
Could you schedule your crawls to be during low search periods? For example, crawling during early morning hours, if your peak search traffic is during afternoons/evenings. You can do this with a cron job to trigger crawls, or you can look at using the Elastic Crawler (instead of the App Search Crawler) to get specific-time scheduling.
Can you decrease the frequency and/or scope of your crawls? Do you need to fully recrawl your site each day, or can you use "partial crawls" to pick up just new or edited pages? This is very dependent on your site's structure, but many highly updated sites will organize pages based on dates, so you can structure partial crawls to use date-specific entrypoints and limited crawl depth to only grab the newest content. Full crawls are likely still periodically necessary, but making these weekly or monthy may help you a lot.
Is the default crawl depth necessary? By default, the crawler will go down 10 levels. But if your site's sitemap is complete, or you provide an extensive list of entrypoints, everything past the first few levels might be redundant. You can analyze your crawler's event logs to understand if you're spending a lot of cycles re-evaluating URLs for pages that you've already seen previously. Since Elasticsearch stores the state of your crawls, any work you can eliminate from your crawl will result in reduced load on Elasticsearch.
You can also look at configuring your crawler to use fewer resources (crawl will take longer, but put less load on ES) or more resources (you'll need bigger nodes, and ES will be under even heavier load, but crawls may complete faster). See the crawler configuration reference.
Is there anyway to slow the crawler down when we reach a high load to avoid crashing?
This isn't really something Crawler can do. If Elasticsearch starts sending it 429s, it'll wait and retry, but it's not going to automatically detect if Elasticsearch is at 80% query capaicity and pre-emptively throttle itself to leave a buffer for other Elasticsearch clients.
I don't think its the correct solution to increasing the size of the elasticsearch just for the crawlers.
It, Depends.
Scaling Elasticsearch to meet your ingestion needs is perfectly valid. If you determine that you NEED to ingest a ton of data, constantly... Then you have to scale ES to meet that need.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.