Does Elastic Web Crawler supports noindex and nofollow directive

Hi all,

does the Elastic Web Crawler supports noindex and nofollow directive? I've found this feature only on the App Search Web Crawler reference Web crawler reference | App Search documentation [8.10] | Elastic and not at the Elastic Web Crawler documentation.

Best regards

Sebastian

Yes, the noindex and nofollow tags are also supported in the Elastic Web Crawler. These are documented here: Optimizing web content for the web crawler | Enterprise Search documentation [8.10] | Elastic

Hi @Sean_Story,

thanks. We found out that noindex and nofollow working fine. We've also found out that the Crawler does not delete a page from index if an already indexed page changes from INDEX to NOINDEX.

How can we force or achive the deletion of a page in that case?

Best regards

Sebastian

We've also found out that the Crawler does not delete a page from index if an already indexed page changes from INDEX to NOINDEX .

That makes sense. Crawler only deletes pages that result in 404's on re-crawl. This is to ensure that legacy documents that simply drop off a link tree don't drop out of your search capabilities.

You can manually delete these documents from the index, and they will not be picked up again, due to the NOINDEX.

Hi @Sean_Story,

from our perspective this might be the wrong interpretation of the NOINDEX specification and meaning.

If a page says NOINDEX it should not be indexed and removed from the index. If I have a look to the Google developer docs, they do it in that way:

Block Search Indexing with noindex | Google Search Central | Documentation | Google for Developers

We have to crawl your page in order to see <meta> tags and HTTP headers. If a page is still appearing in results, it's probably because we haven't crawled the page since you added the noindex rule.

I think we should not ignore the NOINDEX if an webmaster or editor/content provider decides to set this TAG.

Could you please re check whether it is possible to implement that feature?

By the way: Does the the crawler supports X-Robots-Tag: noindex

Regards

Sebastian

Hi @sebastianboelling ,

I can see how you might disagree with the interpretation, but our crawler has been intentionally built in a way that biases towards keeping data searchable as opposed to only being able to search whatever still matches the current crawl configs. For example, if your crawl depth is 2, and a page's links depth goes from 2 to 3, that page remains in the index and is not dropped unless manually deleted.

If I have a look to the Google developer docs

It is not a goal of ours to maintain behavior parity with Google.

Could you please re check whether it is possible to implement that feature?

If you have a support relationship with Elastic, I suggest you work with your support representative to file an Enhancement Request. This is typically how we capture an prioritize ideas from the community.

Does the the crawler supports X-Robots-Tag: noindex

No. The robots meta tags that we support are documented here: Optimizing web content for the web crawler | Enterprise Search documentation [8.11] | Elastic

Hi @Sean_Story ,

thanks for your explantions. I raised up a support request for a feature request - as you suggested.

I maybe found out that this behavior (or a really similar one) for the App Search Web Crawler was fixed with 8.3.0. There is a knowledge base entry which describes the switch from index to noindex. Elastic Support Hub

Maybe you have time to look at the knowledge base article. Maybe it will help to change the behavior of the Elastic Web Crawler or to offer both variants.

Regards

Sebastian

@sebastianboelling I owe you an apology. From that Support Hub page, I was able to track down where this behavior was changed in the App Search Crawler, and from there was able to find tests that imply that this behavior should actually work as you expected in the Elastic Crawler. That is my mistake/misunderstanding, and I'm sorry for pushing against this bug report earlier.

Can you confirm that you've run a full crawl on this index, and not just a "reapply crawl rules" crawl since adding the noindex meta tag?

Hi @Sean_Story,

no worries. We can investigate a bit deeper. But what is the expected behavior for the Elastic Web Crawler from your point of view? In which case pages are deleted?

case 1: page with index meta tag is in index -> page switches to noindex -> full re-crawl -> page deleted from index: yes vs. no

case 2: page with index meta tag is in index -> page switches to noindex -> partial crawl -> page deleted from index: yes vs. no

case 3: page with index meta tag is in index -> page is deleted -> HTTP 404 -> full re-crawl -> page deleted from index: yes vs. no

case 4: page with index meta tag is in index -> page is deleted -> HTTP 404 -> partial crawl -> page deleted from index: yes vs. no

Other cases for deleting?

We only need to understand and can the investigate and decide how to delete or force deletion of pages from our indices.

Best regards

Sebastian

case 1: page with index meta tag is in index -> page switches to noindex -> full re-crawl -> page deleted from index

Yes. Or at least, it should, based on what I'm seeing in the code. I believe you're reporting that this is not working?

case 2: page with index meta tag is in index -> page switches to noindex -> partial crawl -> page deleted from index

No. The purge phase is not run on partial crawls.

case 3: page with index meta tag is in index -> page is deleted -> HTTP 404 -> full re-crawl -> page deleted from index

Yes.

case 4: page with index meta tag is in index -> page is deleted -> HTTP 404 -> partial crawl -> page deleted from index

No. The purge phase is not run on partial crawls.

Other cases for deleting?

I don't believe so.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.