What happens to existing documents during a crawl?

jkyeusun · February 11, 2025, 1:43pm

For the current version of Enterprise Search (8.17), I can't find any documentation on how already indexed documents are managed in a web crawler index, so here are my questions regarding that.

So far I've been looking at the logs to answer my own question but I'd like to know whether the behavior I observed are intended.

If crawl rules are updated so that some documents that are already indexed are now excluded in new crawls, are those documents automatically deleted when a domain is crawled again? (in my tests, the existing document was not automatically deleted even though the document was denied from being indexed in the new crawl, according to the logs)
If a domain is crawled again and the content of some pages haven't changed, are those pages indexed again, or those the crawler or index know not to index the page since it hasn't changed? (in my tests, all pages seemed to be indexed again even if the contents haven't changed. One way to test this is to run two crawls back-to-back to see if the logs of the second crawl indicate a document wasn't indexed again)

nfeekery · February 11, 2025, 1:53pm

Hi @jkyeusun

If crawl rules are updated so that some documents that are already indexed are now excluded in new crawls, are those documents automatically deleted when a domain is crawled again? (in my tests, the existing document was not automatically deleted even though the document was denied from being indexed in the new crawl, according to the logs)

If the crawl rule is properly configured, the document should be deleted. Make sure you're running a full crawl and not a partial crawl. (partial crawls don't purge documents, these are run by selecting Crawl -> Crawl with custom settings)

If a domain is crawled again and the content of some pages haven't changed, are those pages indexed again, or those the crawler or index know not to index the page since it hasn't changed? (in my tests, all pages seemed to be indexed again even if the contents haven't changed. One way to test this is to run two crawls back-to-back to see if the logs of the second crawl indicate a document wasn't indexed again)

The pages are indexed again. Crawler doesn't check if a page has been updated or not, it just indexes everything it finds.

Topic		Replies	Views
Reindexing via scan search type Elasticsearch	14	656	July 6, 2017
Searching while indexing Elasticsearch	4	842	July 6, 2017
Delete old content from index Elasticsearch	2	662	July 6, 2017
Basic question on Indexing Elasticsearch	3	351	July 6, 2017
Update index on updating document in Couchdb Elasticsearch	3	539	July 6, 2017

What happens to existing documents during a crawl?

Related topics