Strategy for remove invalid documents

Timotius_Pamungkas · November 22, 2018, 1:10am

I use elasticsearch to indexing filesystem and create search engine for all PDF documents (around 3 million of them).
My current work flow:

I create a small java program that runs everyday at midnight, crawling filesystem for all PDF files
At the beginning of run, the java programs deletes all documents under my-index
For each pdf files found, I save reference to them to elasticsearch, under my-index. The json is simple, only contains path_to_file, filename, last_modified_date, size_kb

The data keep changing. Sometimes, pdf file renamed or deleted.

The drawback of my approach: my crawler took almost three hours for complete. So within that time interval, some pdf cannot be found on ES. I'd like to keep the documents, only deleting all PDF that no longer exists on filesystem (due to renamed or deleted).

This is my strategy. Is this a good practice, please advise?

Create a new index my-another-index
Create new Java program on midnight. This one not deletes data from my-another-index, but keep my-index intact
Crawl the filesystem for PDF files, put reference to my-another-index
By the end of crawl, my-another-index will have updated contents
Handle deleted pdf : Compare my-index with my-another-index. Remove all documents that does not exists on my-another-index
Handle new pdf : Modify original crawler, don't delete documents from my-index. Only crawl for new files.
Document id for both index are same. Using hashCode of file path.

The problem are:

Is this a correct way?
What is the efficient way to compare two indexes, basically subtracts my-index minus my-another-index? Those are the documents to be deleted from my-index

Thanks

dadoonet · November 22, 2018, 2:10am

Looks similar to FSCrawler project for which I'm using dates but it does not always work very well.

I'm now considering other implementations such as using a rsync method (https://github.com/dadoonet/fscrawler/issues/377) or using a WatchService implementation (https://github.com/dadoonet/fscrawler/issues/399).
Or using something similar to what filebeat is doing and may be rewrite some crawler agents in Golang...

My 2 cents

system · December 20, 2018, 2:15am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Effective Way to Remove Existing Duplicate Documents in ElasticSearch Elasticsearch	12	3994	January 14, 2021
Documents still exist after index deletion Elasticsearch	3	2407	July 5, 2017
How does ES handle deletes? (keeping a sliding window of documents) Elasticsearch	10	1638	July 6, 2017
Issue with consecutive ElasticSearch Query with Java API Elasticsearch	5	432	May 11, 2018
Design question regarding document expiration Elasticsearch	2	322	May 16, 2019

Strategy for remove invalid documents

Related topics