How do you tell ES Web Crawler to stop crawling a parent's child webpages that don't include parent's nameURL name

langelel · December 17, 2024, 1:57pm

Problem:
We are running into a problem using the ES web crawler -> where we set the rules when it hits a specific webpage e.g., [disallow] [contains] [healthtopics] – don’t crawl that page and any of its “child” pages.

Observation:
Currently, we are experiencing:
When ES starts crawling the domain/seed URL -> https://domain_name.com/
It accesses the url https://domain_name.com/healthtopics.html -> it does NOT crawl this web page “healthtopics.html”.
But this https://domain_name.com/healthtopics.html web page contains children web pages that don’t have “healthtopics” in the URL – so the web crawler crawls all the children web pages that don't have “healthtopics” in the url.
E.g., here are some examples of children urls under this parent https://domain_name.com/healthtopics.html url.
https://domain_name.com/bloodheartandcirculation.html
https://domain_name.com/brainandnerves.html

ASK:
Do you know away through the ES web crawler -> “Crawl rules” tab that will tell the ES web crawler -> once you hit a URL that’s in the DO NOT crawl rules list – STOP crawling -> go back to the proceeding URL?
E.g., when the web crawler hits this URL -> https://domain_name.com/healthtopics.html STOP!! Go Back.

Topic		Replies	Views
Little help needed with crawler content exclusion 7.14 Elastic Search elastic-app-search	9	954	January 11, 2022
Suggestions for web crawler Elasticsearch	1	405	September 22, 2018
Include parent or some parent attributes in children hits Elasticsearch	2	360	July 6, 2017
Problem Parents & Children limit, version 0.19.8 Elasticsearch	5	335	July 6, 2017
Corss-index parent/child relationship Elasticsearch	6	879	July 6, 2017

How do you tell ES Web Crawler to stop crawling a parent's child webpages that don't include parent's nameURL name

Related topics