I need to audit a list of billions of filenames from 2 sources that "should" have the same number and filenames. I originally thought of elastic but I'm beginning to think this job would be better suited for an RDBMS.
Very small number of fields but very large list where an RDBMS may be more efficient with proper indices.
ATM only concerned that the filenames exist in both datasets and report on missing.
If I had two lists and wanted to find the differences between them then I'd sort them and then walk them in parallel checking. This is fairly common in RDBMSes where they'll call it a merge join. I'd reach for PostgreSQL personally. Or write a script. But whatever RDBMS you are comfortable with would do the job. I think MySQL will. I'm not 100% sure on that one actually. Years ago it didn't have the best query planning capabilities but I expect it is much better now.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.