I use elasticsearch to indexing filesystem and create search engine for all PDF documents (around 3 million of them).
My current work flow:
- I create a small java program that runs everyday at midnight, crawling filesystem for all PDF files
- At the beginning of run, the java programs deletes all documents under my-index
- For each pdf files found, I save reference to them to elasticsearch, under my-index. The json is simple, only contains path_to_file, filename, last_modified_date, size_kb
The data keep changing. Sometimes, pdf file renamed or deleted.
The drawback of my approach: my crawler took almost three hours for complete. So within that time interval, some pdf cannot be found on ES. I'd like to keep the documents, only deleting all PDF that no longer exists on filesystem (due to renamed or deleted).
This is my strategy. Is this a good practice, please advise?
- Create a new index my-another-index
- Create new Java program on midnight. This one not deletes data from my-another-index, but keep my-index intact
- Crawl the filesystem for PDF files, put reference to my-another-index
- By the end of crawl, my-another-index will have updated contents
- Handle deleted pdf : Compare my-index with my-another-index. Remove all documents that does not exists on my-another-index
- Handle new pdf : Modify original crawler, don't delete documents from my-index. Only crawl for new files.
- Document id for both index are same. Using hashCode of file path.
The problem are:
- Is this a correct way?
- What is the efficient way to compare two indexes, basically subtracts my-index minus my-another-index? Those are the documents to be deleted from my-index