Is there a way to force a full recrawl of all documents in the Elastic Cloud UI? We have added some mapped fields but they are not applied to docs indexed before the mapped fields were added.
I read that a reindex could fix this, however I assume this wouldn't be possible as the reindex would create another index but there would not be a crawler attached to it. Similarly, I assume that if I created a new index with a crawler in the Elastic Cloud UI and applied the mappings, i would still have to manually set up the crawler and extraction rules.
Is there an easy way to do this? Can i just empty the index out somehow and force a fresh recrawl of all documents?
Just using that "crawl all domains in this index" button. Is this not the behavior you're seeing?
This can be complicated if you have urls that are no longer discovered during a crawl, but do still exist on your site. In that case, you might not see those documents changed, but nor will you see them deleted.
Reapplying crawl rules will not execute a new crawl though. That just runs through what you've already ingested and drops documents that no longer match your rules. Could it be that's what you were trying to do?
I read that a reindex could fix this, however I assume this wouldn't be possible as the reindex would create another index but there would not be a crawler attached to it.
Correct, it's difficult to do this today. You could do it in a multi-step process like:
your data is in index-a
reindex your data to index-b (no crawler attached)
delete index-a (without deleting the crawler! DELETE index-a)
reindex your data to index-a from index-b
crawler should keep working
but this feels more convoluted than just deleting your docs with the match-all query, so I wouldn't recommend this approach for your situation.
It does seem to be iterating over all URLs found in the sitemap, however it isn't re-downloading files that already exist in the index (presumably because the page hasn't changed recently). Is there a way to force a re-download?
This might be part of the issue. If a URL is not discovered in future full crawls, should they not be assumed to be removed and deleted? If not, is there any way to achieve this behavior?
however it isn't re-downloading files that already exist in the index (presumably because the page hasn't changed recently)
When running a crawl, we do hash a subset of fields in order to detect "duplicate" pages (same logical page, but maybe with different URLs). But I'm pretty sure that this hash metadata is unique to a given crawl, and that we don't check on subsequent crawls if the page hasn't been updated since. It should be re-downloading every URL it discovers during a full crawl.
You could look in the crawl event logs to try to get some better insight on what's happening. Or if the site is public, tell me what domain you're crawling and I can take a look and see if I can reproduce..
If a URL is not discovered in future full crawls, should they not be assumed to be removed and deleted?
We don't think so. Our policy has been to error on the side of data retention. Having more data than you need is a better problem than losing data that you need. Some customers are crawling sites where old content gets buried deeper and deeper. They still want their historical archives searchable, but they don't want their crawls to need to go to a depth of 1000 to be able to pull legacy pages that haven't changed anyway.
The way the crawler works is to start with the sitemaps/entrypoints, and spider down to the configured depth. Then, it checks any URLs in the index that weren't discovered during that spidering, and checks to see if the site 404s for those URLs. If it determines the page still exists on the site, it'll be left in the index.
If that's not the behavior you want, it's easy to write a delete-by-query request against your index to remove any documents that have a last_crawled_at date older than the most recent full crawl.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.