Ignoring robots noindex / nofollow in Elastic crawler

Hello!

Is there a way to ignore the meta robots nofollow / noindex in the Elastic crawler settings? If not, is there some other way to filter these meta fields out? We have a development site with these robots meta tags set on each page as a safeguard to Google, Bing etc indexing the site if it was accidentally made public.

Thanks

Imran

Hi @pngworkforce ,

No, you cannot bypass nofollow and robots.txt directives. These are in place to prevent our crawler being used to abuse sites that do not belong to the person using the crawler.

For your case, perhaps you could test the crawler against a version of the site that isn't accessible by the public internet (blocking google) but can be hit in a VPN where the crawler is as well?

Thanks Sean. I guessed this might be the case.

Just to confirm, there is also no way to filter this out during or post crawl using ingestion pipelines etc?

  • Imran

Correct. Ingest pipelines work after the crawler, once docs have been sent to Elasticsearch, but aren't yet stored in an Elasticsearch Index. But the decision of whether or not to process a page and its links happens in the Crawler's logic, much before that.