Ignoring robots noindex / nofollow in Elastic crawler

pngworkforce · October 20, 2024, 12:15pm

Hello!

Is there a way to ignore the meta robots nofollow / noindex in the Elastic crawler settings? If not, is there some other way to filter these meta fields out? We have a development site with these robots meta tags set on each page as a safeguard to Google, Bing etc indexing the site if it was accidentally made public.

Thanks

Imran

Sean_Story · October 20, 2024, 1:12pm

Hi @pngworkforce ,

No, you cannot bypass nofollow and robots.txt directives. These are in place to prevent our crawler being used to abuse sites that do not belong to the person using the crawler.

For your case, perhaps you could test the crawler against a version of the site that isn't accessible by the public internet (blocking google) but can be hit in a VPN where the crawler is as well?

pngworkforce · October 20, 2024, 1:28pm

Thanks Sean. I guessed this might be the case.

Just to confirm, there is also no way to filter this out during or post crawl using ingestion pipelines etc?

Imran

Sean_Story · October 21, 2024, 2:57pm

Correct. Ingest pipelines work after the crawler, once docs have been sent to Elasticsearch, but aren't yet stored in an Elasticsearch Index. But the decision of whether or not to process a page and its links happens in the Crawler's logic, much before that.

system · November 18, 2024, 2:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Does Elastic Web Crawler supports noindex and nofollow directive Elastic Search elastic-app-search	10	649	December 8, 2023
Make the search crawler ignore sitemap Elastic Search elastic-app-search	3	490	January 12, 2023
Delete (Elastic Web crawler) crawled Web pages - maybe per Pipeline Elastic Search painless , elastic-app-search	4	506	December 5, 2023
Best way to exclude headers and footers on external website Elastic Search elastic-app-search	3	752	October 7, 2022
Webcrawler Elastic Search elastic-app-search	2	394	July 14, 2021

Ignoring robots noindex / nofollow in Elastic crawler

Related topics