We're keeping 15 months of data open in ES for analytics purpose.
Now, with ES 6x, we cannot have multiple types in a single index.
So on an average, if we break down these indices into a TYPE of its own, the count comes to around 80 indexes per month from 39.
This will be an ideal way for data modeling.
We can make a hack to add a custom type field say custom_doc_type and keep indexing the same docs in the existing index. But ultimately it'll be a violation of data modeling rule and sparse index will be created.
Our dilemma is what's the ideal way to approach this problem, so we can keep the number of shards and open file in check while at the same time adhere to data modeling guidelines?
This data is only queried via Kibana and used for analytics purpose.
Is there a reason you have 2 replicas? Also, you can easily reduce the shard count on all of those indices to 1 primary, that will reduce the number of shards you have.
Reduce primary count, as above.
It's not clear if those are all monthly indices, as we cannot see the names, but you only have 240ish shards now, which isn't a large number.
Even with 80 different indices, if they are all going to be >50GB, then you could consider going to quarterly batching to minimise shard count.
There are a handful of fields which are common across schema. Ex. like 6-7 fields common across 35-40 types. But rest of the fields in that doc type are different which will create a sparse index issue.
There are like 3-4 fields across max 10 types which have conflicts in data type. Like field_a: text in type_a and field_a: float in type_b. With some work this can be resolved.
In Elasticsearch there has been improvements to the handling of sparse fields so unless there are mapping conflicts I would not immediately rule out the option to store all types in a single index and add a separate field that indicates the type that you can filter on. If you still choose to go down the route of separate indices I would recommend having a single primary shard for all indices where the shard size is unlikely to exceed a few tens of GB in size. This means that a single index may not be spread out across all nodes, but as you have 80 indices all nodes should still have a good amount of data, so I do not see this necessarily be a problem.
So, in some indices, if more than 70% of data fields are empty, will it still be a safe option to not go down the splitting-into-different-index route.
So, just to check I understand this correctly -
Add a field custom_type and copy es5x _type value to custom_type
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.