Simple elasticsearch database designe help

Hello! I have a question about elastic database design. My data is 10 different collections spread to tapes. Every tape can have one or more collections stored and on every tape there are 100K+ unique files. In elastic I would like to store only metadata about tapeName, collectionName, fileName, fileInfo, fileSize, fileDate, fileHash. Searches would be mainly collection specific but also sometimes all collections wide. Physical layout:

tape1
collection1
|-> file1
|-> ..

tape2
collection1
|-> file1x
|-> ..
collection2
|-> file1
|->..

Below is what I have in mind so far.. or should it be different index for every collection or smb else?

PUT tapes
{
  "settings" : {
  "number_of_shards" : 1,
  "number_of_replicas" : 0
  },
  "mappings": {
    "_doc": {
      "properties": {
         "tapeName": {"type": "text"},
         "collectionName": {"type": "keyword"},
         "fileName": {"type": "text"},
         "fileInfo": {"type": "text"},
         "fileSize": {"type": "integer"},
         "fileDate": {"type": "date"},
         "fileHash": {"type": "keyword"}
      }
    }
  }
}

Will every document have a failDate?

Yes, every document have fileName, fileSize, fileDate, fileHash always filled. fileInfo is filled very rarerly. And all the documents are associated with tapeName and collectionName.

fileDate is the file modification date from filesystem. In case there are more then one with same fileName and they have different fileHash, I can compare fileDate to make some decision.

Fail or file?

sry, yes file :slight_smile:
fixed above. In my mother language fail translates to file :smiley:

I'd suggest you use time based indices, even if yearly. eg tapes-YYYY, and then use the fileDate as the timestamp.

Otherwise what you are suggesting looks fine, and it's great that you are thinking this far ahead :slight_smile:

1 Like

Thank you very much for your suggestion! I will seriously consider it!

Beside brand new files, old data is rotated ie. 5-8 years old tapes are deleted and rewritten to new ones. When I use indices per year, then it would be very easy to delete old years. Nice.

Priority is to find from fileName value and usually it is known from witch collection it should be. Also collectionName and tapeName are important. Other data (info, date, size, hash) is secondary.

Do I understand correctly that if I have indices per year, then I have to search through all collections to find ie. fileName?

Yes, but that's not really anything worth worrying about.

Warkolm,
so you are saying indice name is tapename-yyyy ?

if he will have 1000 tapes then he will have 1000 indice.
what if he use same indice and use document_id = tapes-YYYY., and fileDate as timestamp ?

If you have one per tape, yes. But I suggested that they group all tapes by year.

oh ok I misunderstood.
index name is Tapes-2019, then Tapes-2020 got it

I can think of two possibilities for yearly indices:

a) current year (tapes-2019, tapes-2020..) and that makes total indices count to 5-8 (time window for rewrite tapes). It’s easy to discard old indices (simply delete from file system) but individual indices size will grow gradually (new data is added constantly and old is rewritten).

b) fileDate year (tapes-1990, .. tapes-2019) and that makes total indices count to 29 as of today. Indices size would be more spread out but discarded tapes data have to be deleted inside indices separately. Is it a problem?

As there are 100000+ files per tape, should I look for join datatype for example tapeName as parent and all other fields as child? Or collectionName as parent and all others as child?

Just make it flat. Each document would contain all the fields required.

Thanks for helping me! I preciate it and happy smile :smiley:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.