Delete (Elastic Web crawler) crawled Web pages - maybe per Pipeline

Hi all,

we are looking a bit arround how we can delete crawled "pages" (documents) from an index when they are outdated or change their nodindex from INDEX to index or if they match any other metadata.

We experimented with Ingest Pipelines but they do not have processors for deleting documents from an index (of course they are about "Ingest"). Also Painless do not have functions for deleting documents based on their metadata, maybe if the puplishing date is older than a specified date.

What could be an approach to run some scheduled "cleaning" process of documents in an index which are not deleted by the web crawler because no 4xx or 5xx is thrown.

Of course we could run manually delete_by_query, but we are looking for a bit more automation or integration in pipelines.

Best regards

Sebastian

Hi Sebastian,

There is no dedicated "cleaning" process, however, Elastic Web Crawler deletes pages that have >= 300 && <= 599 response codes during the full crawl.

Crawler should also remove pages from the index that have "noindex" tag but I need to double-check this. I'll keep you posted.

Does your configured Crawler delete pages during full crawls? If not, could you please provide some Crawler logs that include not deleted pages?

Hi video,

sorry my first sentence was a bit weird. What I meant was:

  1. A page was crawled indexed with <meta name= ‘‘robots“ content= ‘‘index“>
  2. A page is re-crawled with <meta name= ‘‘robots“ content= ‘‘noindex“>

Will the page be deleted from index?

Did you have the opportunity to check it out?

Regards

Sebastian

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

@sebastianboelling if a full crawl is run, pages that change to noindex should be deleted. See our discussion on another forum post: Does Elastic Web Crawler supports noindex and nofollow directive - #9 by sebastianboelling