Is there a way to ignore the meta robots nofollow / noindex in the Elastic crawler settings? If not, is there some other way to filter these meta fields out? We have a development site with these robots meta tags set on each page as a safeguard to Google, Bing etc indexing the site if it was accidentally made public.
No, you cannot bypass nofollow and robots.txt directives. These are in place to prevent our crawler being used to abuse sites that do not belong to the person using the crawler.
For your case, perhaps you could test the crawler against a version of the site that isn't accessible by the public internet (blocking google) but can be hit in a VPN where the crawler is as well?
Correct. Ingest pipelines work after the crawler, once docs have been sent to Elasticsearch, but aren't yet stored in an Elasticsearch Index. But the decision of whether or not to process a page and its links happens in the Crawler's logic, much before that.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.