thanks. We found out that noindex and nofollow working fine. We've also found out that the Crawler does not delete a page from index if an already indexed page changes from INDEX to NOINDEX.
How can we force or achive the deletion of a page in that case?
We've also found out that the Crawler does not delete a page from index if an already indexed page changes from INDEX to NOINDEX .
That makes sense. Crawler only deletes pages that result in 404's on re-crawl. This is to ensure that legacy documents that simply drop off a link tree don't drop out of your search capabilities.
You can manually delete these documents from the index, and they will not be picked up again, due to the NOINDEX.
We have to crawl your page in order to see <meta> tags and HTTP headers. If a page is still appearing in results, it's probably because we haven't crawled the page since you added the noindex rule.
I think we should not ignore the NOINDEX if an webmaster or editor/content provider decides to set this TAG.
Could you please re check whether it is possible to implement that feature?
By the way: Does the the crawler supports X-Robots-Tag: noindex
I can see how you might disagree with the interpretation, but our crawler has been intentionally built in a way that biases towards keeping data searchable as opposed to only being able to search whatever still matches the current crawl configs. For example, if your crawl depth is 2, and a page's links depth goes from 2 to 3, that page remains in the index and is not dropped unless manually deleted.
If I have a look to the Google developer docs
It is not a goal of ours to maintain behavior parity with Google.
Could you please re check whether it is possible to implement that feature?
If you have a support relationship with Elastic, I suggest you work with your support representative to file an Enhancement Request. This is typically how we capture an prioritize ideas from the community.
Does the the crawler supports X-Robots-Tag: noindex
thanks for your explantions. I raised up a support request for a feature request - as you suggested.
I maybe found out that this behavior (or a really similar one) for the App Search Web Crawler was fixed with 8.3.0. There is a knowledge base entry which describes the switch from index to noindex. Elastic Support Hub
Maybe you have time to look at the knowledge base article. Maybe it will help to change the behavior of the Elastic Web Crawler or to offer both variants.
@sebastianboelling I owe you an apology. From that Support Hub page, I was able to track down where this behavior was changed in the App Search Crawler, and from there was able to find tests that imply that this behavior should actually work as you expected in the Elastic Crawler. That is my mistake/misunderstanding, and I'm sorry for pushing against this bug report earlier.
Can you confirm that you've run a full crawl on this index, and not just a "reapply crawl rules" crawl since adding the noindex meta tag?
no worries. We can investigate a bit deeper. But what is the expected behavior for the Elastic Web Crawler from your point of view? In which case pages are deleted?
case 1: page with index meta tag is in index -> page switches to noindex -> full re-crawl -> page deleted from index: yes vs. no
case 2: page with index meta tag is in index -> page switches to noindex -> partial crawl -> page deleted from index: yes vs. no
case 3: page with index meta tag is in index -> page is deleted -> HTTP 404 -> full re-crawl -> page deleted from index: yes vs. no
case 4: page with index meta tag is in index -> page is deleted -> HTTP 404 -> partial crawl -> page deleted from index: yes vs. no
Other cases for deleting?
We only need to understand and can the investigate and decide how to delete or force deletion of pages from our indices.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.