Problem:
We are running into a problem using the ES web crawler -> where we set the rules when it hits a specific webpage e.g., [disallow] [contains] [healthtopics] – don’t crawl that page and any of its “child” pages.
Observation:
Currently, we are experiencing:
When ES starts crawling the domain/seed URL -> https://domain_name.com/
It accesses the url https://domain_name.com/healthtopics.html -> it does NOT crawl this web page “healthtopics.html”.
But this https://domain_name.com/healthtopics.html web page contains children web pages that don’t have “healthtopics” in the URL – so the web crawler crawls all the children web pages that don't have “healthtopics” in the url.
E.g., here are some examples of children urls under this parent https://domain_name.com/healthtopics.html url.
https://domain_name.com/bloodheartandcirculation.html
https://domain_name.com/brainandnerves.html
ASK:
Do you know away through the ES web crawler -> “Crawl rules” tab that will tell the ES web crawler -> once you hit a URL that’s in the DO NOT crawl rules list – STOP crawling -> go back to the proceeding URL?
E.g., when the web crawler hits this URL -> https://domain_name.com/healthtopics.html STOP!! Go Back.