I have relatively big sets of documents that we load into elasticsearch via java rest client. Every day we do a batch load of data (basically our full data set) at midnight and then through the day we load delta files which are changes to existing dataset.
For the full data set we load it into an index and then switch alias from old index to a new one and it works fine and fast.
Now the problem is with delta files. There are basically two main fields that identify the document - ID and Version of the document. There are several approaches for loading delta files:
- As we read each record from a file we check if it's version is greater than the version in Elastic and put it into a batch for update and then execute the update. This works fine for small files but large files take very long time since there are 100s of thousands of reads from Elastic.
- Do the same as #1 but read in batch - i.e. read 1000 records, request ID and Version from Elastic for a 1000, compare with what we have in file in memory and then batch commit to Elastic. Would probably work ok, we need to see how much it will improve performance.
- Read the whole set of IDs and Versions into memory and as we read the file compare to that in-memory version. This works ok for large files, but would not be ideal for small files as we would need to load large amount of IDs and Versions (we can have two versions for large and small delta files though).
- Use elastic version control with externally defined version. That should probably work, but we had some issues implementing this as we didn't have consistent error messages (and we would need to parse the errors of batch inserting to see if it was just lower version or if some other problem happened and deal with the errors).
- Load the whole delta file into elastic and do some kind of join if possible (not sure if it's a viable option)
What do you think is the best way to do it? Maybe there is some other better way to do it.