Delete (Elastic Web crawler) crawled Web pages - maybe per Pipeline

sebastianboelling · October 13, 2023, 8:36am

Hi all,

we are looking a bit arround how we can delete crawled "pages" (documents) from an index when they are outdated or change their nodindex from INDEX to index or if they match any other metadata.

We experimented with Ingest Pipelines but they do not have processors for deleting documents from an index (of course they are about "Ingest"). Also Painless do not have functions for deleting documents based on their metadata, maybe if the puplishing date is older than a specified date.

What could be an approach to run some scheduled "cleaning" process of documents in an index which are not deleted by the web crawler because no 4xx or 5xx is thrown.

Of course we could run manually delete_by_query, but we are looking for a bit more automation or integration in pipelines.

Best regards

Sebastian

video · October 13, 2023, 4:18pm

Hi Sebastian,

There is no dedicated "cleaning" process, however, Elastic Web Crawler deletes pages that have >= 300 && <= 599 response codes during the full crawl.

Crawler should also remove pages from the index that have "noindex" tag but I need to double-check this. I'll keep you posted.

Does your configured Crawler delete pages during full crawls? If not, could you please provide some Crawler logs that include not deleted pages?

sebastianboelling · November 6, 2023, 1:32pm

Hi video,

sorry my first sentence was a bit weird. What I meant was:

A page was crawled indexed with <meta name= ‘‘robots“ content= ‘‘index“>
A page is re-crawled with <meta name= ‘‘robots“ content= ‘‘noindex“>

Will the page be deleted from index?

Did you have the opportunity to check it out?

Regards

Sebastian

system · December 4, 2023, 1:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Sean_Story · December 5, 2023, 2:30pm

@sebastianboelling if a full crawl is run, pages that change to noindex should be deleted. See our discussion on another forum post: Does Elastic Web Crawler supports noindex and nofollow directive - #9 by sebastianboelling

Topic		Replies	Views
Does Elastic Web Crawler supports noindex and nofollow directive Elastic Search elastic-app-search	10	649	December 8, 2023
Calling a pipeline on deletion operations Elasticsearch	4	391	July 23, 2019
After I delete a doc, and then I search it, I got it. after a while, when I search it, I can't get it Elasticsearch	3	352	July 5, 2017
Watcher delete document action Elasticsearch elastic-stack-alerting	2	658	August 17, 2022
Dynamically cancel processing in pipeline script (painless) Elasticsearch	4	663	October 17, 2017

Delete (Elastic Web crawler) crawled Web pages - maybe per Pipeline

Related topics