Hello,
I'm looking for some architectural best practice advice from someone with more experience(read wisdom) than myself with ES.
We are testing a new ES 1.5.2 cluster with 3 nodes(8 core/16gb ram/1.5 tb es datastore).
We are attempting to use ES to create a distributed binary datastore system. The starting dataset is about 25m binary files averaging ~40k in size.
We were initially going to use Riak, but since the community is much more active with ES that we wanted to see if we could build it with ES.
Also, I wasn't able to find much authoritative on mappings or other configuration tweaks for non-indexed binary data in ES(AHEM Hopefully this thread becomes that for others).
Here's a very simple first stab at mappings:
"index_name": {
"mappings": {
"type_name": {
"_all": {
"enabled": false
},
"properties": {
"field_name": {
"type": "binary"
}
}
}
}
}
Also, replicas=0, shards=1. For our particular use case, it is more practical to re-index on detection of loss than to have replicas.
Settings:
"number_of_replicas": "0",
"number_of_shards": "1",
The data logically partitions into about 300 segments, so we have tried that as an initial separator for indexes.
Although initial test looks promising, we are looking for suggestions and tuning options. I'd be happy to report back and contribute some benchmarking results for the community in exchange for knowledge.
Questions:
-
If we stay with the ~300 indexes and 1 shard a piece, will ES rebalance if a single node becomes the host for disproportionately large indexes in terms of total storage? Or is the primary always the primary no matter what when you use a single shard?
-
I've read here on the groups discussion that 50gb is a good rule of thumb for max shard size. Since we are not attempting to use the attachment plugin or index the data in other ways, can we essentially just use one index infinitely? To put it another way, is there any rule of thumb for max index size when using it as a key/value store for binary data?
-
If we want to insert metadata about the binary data, would "best practice" be to add it in the same elasticsearch document? Or would it be better to leave the index for the binary data as a pure key/value store and put the metadata in a different index?
-
Are there any particular configuration tweaks that are recommended not addressed here?
-
In old un-authoritative posts found on google, it seemed like ES discouraged use as a distributed file store. Is this still the company's official position? If so, any elaboration on rationale would be awesome!
-
Are there any recommendation on a max amount of binary data each machine should be responsible for? I'm assuming not, so are there recommendations for how to benchmark for binary data storage other than "wait until it breaks"? I apologize for the vague-ness of this one, but I am not aware of any other discussions on the topic.
Thanks in advance for any advice or knowledge around this use case. I hope this can be the start of some great documentation for others considering this path.