we are looking a bit arround how we can delete crawled "pages" (documents) from an index when they are outdated or change their nodindex from INDEX to index or if they match any other metadata.
We experimented with Ingest Pipelines but they do not have processors for deleting documents from an index (of course they are about "Ingest"). Also Painless do not have functions for deleting documents based on their metadata, maybe if the puplishing date is older than a specified date.
What could be an approach to run some scheduled "cleaning" process of documents in an index which are not deleted by the web crawler because no 4xx or 5xx is thrown.
Of course we could run manually delete_by_query, but we are looking for a bit more automation or integration in pipelines.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.