Sharding Best Practices for Binary Non-Indexed Data

Hello,

I'm looking for some architectural best practice advice from someone with more experience(read wisdom) than myself with ES.

We are testing a new ES 1.5.2 cluster with 3 nodes(8 core/16gb ram/1.5 tb es datastore).

We are attempting to use ES to create a distributed binary datastore system. The starting dataset is about 25m binary files averaging ~40k in size.

We were initially going to use Riak, but since the community is much more active with ES that we wanted to see if we could build it with ES.

Also, I wasn't able to find much authoritative on mappings or other configuration tweaks for non-indexed binary data in ES(AHEM Hopefully this thread becomes that for others).

Here's a very simple first stab at mappings:

   "index_name": {
      "mappings": {
         "type_name": {
            "_all": {
               "enabled": false
            },
            "properties": {
               "field_name": {
                  "type": "binary"
               }
            }
         }
      }
   }

Also, replicas=0, shards=1. For our particular use case, it is more practical to re-index on detection of loss than to have replicas.

Settings:

    "number_of_replicas": "0",
    "number_of_shards": "1",

The data logically partitions into about 300 segments, so we have tried that as an initial separator for indexes.

Although initial test looks promising, we are looking for suggestions and tuning options. I'd be happy to report back and contribute some benchmarking results for the community in exchange for knowledge.

Questions:

  1. If we stay with the ~300 indexes and 1 shard a piece, will ES rebalance if a single node becomes the host for disproportionately large indexes in terms of total storage? Or is the primary always the primary no matter what when you use a single shard?

  2. I've read here on the groups discussion that 50gb is a good rule of thumb for max shard size. Since we are not attempting to use the attachment plugin or index the data in other ways, can we essentially just use one index infinitely? To put it another way, is there any rule of thumb for max index size when using it as a key/value store for binary data?

  3. If we want to insert metadata about the binary data, would "best practice" be to add it in the same elasticsearch document? Or would it be better to leave the index for the binary data as a pure key/value store and put the metadata in a different index?

  4. Are there any particular configuration tweaks that are recommended not addressed here?

  5. In old un-authoritative posts found on google, it seemed like ES discouraged use as a distributed file store. Is this still the company's official position? If so, any elaboration on rationale would be awesome!

  6. Are there any recommendation on a max amount of binary data each machine should be responsible for? I'm assuming not, so are there recommendations for how to benchmark for binary data storage other than "wait until it breaks"? I apologize for the vague-ness of this one, but I am not aware of any other discussions on the topic.

Thanks in advance for any advice or knowledge around this use case. I hope this can be the start of some great documentation for others considering this path.

my two cents:

If we stay with the ~300 indexes and 1 shard a piece, will ES rebalance if a single node becomes the host for disproportionately large indexes in terms of total storage? Or is the primary always the primary no matter what when you use a single shard?

that's like 1 index 1 shard, so total 300 shards span across 3 nodes? looks okay for 25million documents. but you mentioned, "Also, replicas=0, shards=1.", what if one of the total 3 nodes, down, you don't have replica? things will start to break from here. if you want to reindex, why choose datastore in the first place? you want datastore because you want durable and redundancy?

I've read here on the groups discussion that 50gb is a good rule of thumb for max shard size. Since we are not attempting to use the attachment plugin or index the data in other ways, can we essentially just use one index infinitely? To put it another way, is there any rule of thumb for max index size when using it as a key/value store for binary data?

you are using elasticsearch to search data right? if you just have a field type binary, what do you intend to search? don't think you can search on binary field Field data types | Elasticsearch Guide [8.11] | Elastic. for the limit, read here org.apache.lucene.codecs.lucene40 (Lucene 4.10.3 API)

If we want to insert metadata about the binary data, would "best practice" be to add it in the same elasticsearch document? Or would it be better to leave the index for the binary data as a pure key/value store and put the metadata in a different index?

same index perhaps... assuming metadata size is not huge. you mentioend " The starting dataset is about 25m binary files averaging ~40k in size." , this should be doable. when you retrieve the document, make sure you have a limit, to prevent oom Limiting Memory Usage | Elasticsearch: The Definitive Guide [master] | Elastic. but if you want to retrieve document in two requests, maybe put the binary files on some other datastore, and then index the metadata in es. when you query, you query in es and the result contain the unique id to the file on another datastore.

Are there any recommendation on a max amount of binary data each machine should be responsible for? I'm assuming not, so are there recommendations for how to benchmark for binary data storage other than "wait until it breaks"? I apologize for the vague-ness of this one, but I am not aware of any other discussions on the topic.

there are many monitoring for elasticsearch, you should install them and monitor your cluster. from monitoring tools output, you get metric to show you or you can forecast if things start to slow down or break soon.

hth

jason

I feel bad saying it as a contributor to Elasticsearch, but maybe Apache Cassandra? Elasticsearch will probably have lots more write amplification than a like Cassandra which doesn't try to do search.

OTOH 40k isn't large at all - we do full text search on much much larger documents millions of times a day with Elasticsearch and it works fine. While updating them continuously. So it'd probably work fine.