Hello! I have a question about elastic database design. My data is 10 different collections spread to tapes. Every tape can have one or more collections stored and on every tape there are 100K+ unique files. In elastic I would like to store only metadata about tapeName, collectionName, fileName, fileInfo, fileSize, fileDate, fileHash. Searches would be mainly collection specific but also sometimes all collections wide. Physical layout:
Yes, every document have fileName, fileSize, fileDate, fileHash always filled. fileInfo is filled very rarerly. And all the documents are associated with tapeName and collectionName.
fileDate is the file modification date from filesystem. In case there are more then one with same fileName and they have different fileHash, I can compare fileDate to make some decision.
Thank you very much for your suggestion! I will seriously consider it!
Beside brand new files, old data is rotated ie. 5-8 years old tapes are deleted and rewritten to new ones. When I use indices per year, then it would be very easy to delete old years. Nice.
Priority is to find from fileName value and usually it is known from witch collection it should be. Also collectionName and tapeName are important. Other data (info, date, size, hash) is secondary.
Do I understand correctly that if I have indices per year, then I have to search through all collections to find ie. fileName?
I can think of two possibilities for yearly indices:
a) current year (tapes-2019, tapes-2020..) and that makes total indices count to 5-8 (time window for rewrite tapes). It’s easy to discard old indices (simply delete from file system) but individual indices size will grow gradually (new data is added constantly and old is rewritten).
b) fileDate year (tapes-1990, .. tapes-2019) and that makes total indices count to 29 as of today. Indices size would be more spread out but discarded tapes data have to be deleted inside indices separately. Is it a problem?
As there are 100000+ files per tape, should I look for join datatype for example tapeName as parent and all other fields as child? Or collectionName as parent and all others as child?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.