What happens to existing documents during a crawl?

For the current version of Enterprise Search (8.17), I can't find any documentation on how already indexed documents are managed in a web crawler index, so here are my questions regarding that.

So far I've been looking at the logs to answer my own question but I'd like to know whether the behavior I observed are intended.

  • If crawl rules are updated so that some documents that are already indexed are now excluded in new crawls, are those documents automatically deleted when a domain is crawled again? (in my tests, the existing document was not automatically deleted even though the document was denied from being indexed in the new crawl, according to the logs)

  • If a domain is crawled again and the content of some pages haven't changed, are those pages indexed again, or those the crawler or index know not to index the page since it hasn't changed? (in my tests, all pages seemed to be indexed again even if the contents haven't changed. One way to test this is to run two crawls back-to-back to see if the logs of the second crawl indicate a document wasn't indexed again)

Hi @jkyeusun

If crawl rules are updated so that some documents that are already indexed are now excluded in new crawls, are those documents automatically deleted when a domain is crawled again? (in my tests, the existing document was not automatically deleted even though the document was denied from being indexed in the new crawl, according to the logs)

If the crawl rule is properly configured, the document should be deleted. Make sure you're running a full crawl and not a partial crawl. (partial crawls don't purge documents, these are run by selecting Crawl -> Crawl with custom settings)

If a domain is crawled again and the content of some pages haven't changed, are those pages indexed again, or those the crawler or index know not to index the page since it hasn't changed? (in my tests, all pages seemed to be indexed again even if the contents haven't changed. One way to test this is to run two crawls back-to-back to see if the logs of the second crawl indicate a document wasn't indexed again)

The pages are indexed again. Crawler doesn't check if a page has been updated or not, it just indexes everything it finds.