How do you tell ES Web Crawler to stop crawling a parent's child webpages that don't include parent's nameURL name

Problem:
We are running into a problem using the ES web crawler -> where we set the rules when it hits a specific webpage e.g., [disallow] [contains] [healthtopics] – don’t crawl that page and any of its “child” pages.

Observation:
Currently, we are experiencing:
When ES starts crawling the domain/seed URL -> https://domain_name.com/
It accesses the url https://domain_name.com/healthtopics.html -> it does NOT crawl this web page “healthtopics.html”.
But this https://domain_name.com/healthtopics.html web page contains children web pages that don’t have “healthtopics” in the URL – so the web crawler crawls all the children web pages that don't have “healthtopics” in the url.
E.g., here are some examples of children urls under this parent https://domain_name.com/healthtopics.html url.
https://domain_name.com/bloodheartandcirculation.html
https://domain_name.com/brainandnerves.html

ASK:
Do you know away through the ES web crawler -> “Crawl rules” tab that will tell the ES web crawler -> once you hit a URL that’s in the DO NOT crawl rules list – STOP crawling -> go back to the proceeding URL?
E.g., when the web crawler hits this URL -> https://domain_name.com/healthtopics.html STOP!! Go Back.

Hi @langelel,

If you have set up a crawl rule for [disallow] [contains] [healthtopics], then Crawler won't attempt to crawl any URL links on the healthtopics page because it won't ingest the page.

If these child pages of healthtopics are still being ingested, then they are still "discoverable" by Crawler. There are a few ways this could happen:

  1. Other pages in the site link to those pages
  2. They are included in the sitemap.xml

So my recommendation is to either try and see how Crawler is finding these pages, or to write more crawl rules that cover these pages.