Simple elasticsearch database designe help

Mart · January 20, 2019, 12:09pm

Hello! I have a question about elastic database design. My data is 10 different collections spread to tapes. Every tape can have one or more collections stored and on every tape there are 100K+ unique files. In elastic I would like to store only metadata about tapeName, collectionName, fileName, fileInfo, fileSize, fileDate, fileHash. Searches would be mainly collection specific but also sometimes all collections wide. Physical layout:

tape1
collection1
|-> file1
|-> ..

tape2
collection1
|-> file1x
|-> ..
collection2
|-> file1
|->..

Below is what I have in mind so far.. or should it be different index for every collection or smb else?

PUT tapes
{
  "settings" : {
  "number_of_shards" : 1,
  "number_of_replicas" : 0
  },
  "mappings": {
    "_doc": {
      "properties": {
         "tapeName": {"type": "text"},
         "collectionName": {"type": "keyword"},
         "fileName": {"type": "text"},
         "fileInfo": {"type": "text"},
         "fileSize": {"type": "integer"},
         "fileDate": {"type": "date"},
         "fileHash": {"type": "keyword"}
      }
    }
  }
}

warkolm · January 21, 2019, 2:51am

Will every document have a failDate?

Mart · January 21, 2019, 4:24am

Yes, every document have fileName, fileSize, fileDate, fileHash always filled. fileInfo is filled very rarerly. And all the documents are associated with tapeName and collectionName.

fileDate is the file modification date from filesystem. In case there are more then one with same fileName and they have different fileHash, I can compare fileDate to make some decision.

warkolm · January 21, 2019, 4:25am

Fail or file?

Mart · January 21, 2019, 4:27am

sry, yes file
fixed above. In my mother language fail translates to file

warkolm · January 21, 2019, 6:32am

I'd suggest you use time based indices, even if yearly. eg tapes-YYYY, and then use the fileDate as the timestamp.

Otherwise what you are suggesting looks fine, and it's great that you are thinking this far ahead

Mart · January 21, 2019, 11:16am

Thank you very much for your suggestion! I will seriously consider it!

Beside brand new files, old data is rotated ie. 5-8 years old tapes are deleted and rewritten to new ones. When I use indices per year, then it would be very easy to delete old years. Nice.

Priority is to find from fileName value and usually it is known from witch collection it should be. Also collectionName and tapeName are important. Other data (info, date, size, hash) is secondary.

Do I understand correctly that if I have indices per year, then I have to search through all collections to find ie. fileName?

warkolm · January 21, 2019, 9:14pm

Yes, but that's not really anything worth worrying about.

elasticforme · January 21, 2019, 9:19pm

Warkolm,
so you are saying indice name is tapename-yyyy ?

if he will have 1000 tapes then he will have 1000 indice.
what if he use same indice and use document_id = tapes-YYYY., and fileDate as timestamp ?

warkolm · January 21, 2019, 10:05pm

If you have one per tape, yes. But I suggested that they group all tapes by year.

elasticforme · January 21, 2019, 10:17pm

oh ok I misunderstood.
index name is Tapes-2019, then Tapes-2020 got it

Mart · January 22, 2019, 5:29am

I can think of two possibilities for yearly indices:

a) current year (tapes-2019, tapes-2020..) and that makes total indices count to 5-8 (time window for rewrite tapes). It’s easy to discard old indices (simply delete from file system) but individual indices size will grow gradually (new data is added constantly and old is rewritten).

b) fileDate year (tapes-1990, .. tapes-2019) and that makes total indices count to 29 as of today. Indices size would be more spread out but discarded tapes data have to be deleted inside indices separately. Is it a problem?

As there are 100000+ files per tape, should I look for join datatype for example tapeName as parent and all other fields as child? Or collectionName as parent and all others as child?

warkolm · January 22, 2019, 5:31am

Just make it flat. Each document would contain all the fields required.

Mart · January 22, 2019, 5:45am

Thanks for helping me! I preciate it and happy smile

system · February 19, 2019, 5:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Advice on design & query Elasticsearch	1	355	June 22, 2019
Feedback for planned data structure/mapping Elasticsearch	8	377	July 6, 2017
Help Searching the data from my index Elasticsearch	4	453	August 6, 2020
Help me understand Elasticsearch structure Elasticsearch	6	336	March 22, 2021
Design Index & Document Elasticsearch	1	246	May 5, 2023

Simple elasticsearch database designe help

Related topics