Data modelling dilemma: Number of indices vs Sparse index

We have an ES 5x setup and we create around 39 indices on monthly basis.

Index distribution in the setup - https://gist.github.com/jay-dihenkar/80c9d3774b05027bd0f5db37df57b555

Cluster details -
Nodes = 3
Heap = 31g

We're keeping 15 months of data open in ES for analytics purpose.

Now, with ES 6x, we cannot have multiple types in a single index.

So on an average, if we break down these indices into a TYPE of its own, the count comes to around 80 indexes per month from 39.

This will be an ideal way for data modeling.

We can make a hack to add a custom type field say custom_doc_type and keep indexing the same docs in the existing index. But ultimately it'll be a violation of data modeling rule and sparse index will be created.

Our dilemma is what's the ideal way to approach this problem, so we can keep the number of shards and open file in check while at the same time adhere to data modeling guidelines?

This data is only queried via Kibana and used for analytics purpose.

Is there a reason you have 2 replicas? Also, you can easily reduce the shard count on all of those indices to 1 primary, that will reduce the number of shards you have.

Reduce primary count, as above.
It's not clear if those are all monthly indices, as we cannot see the names, but you only have 240ish shards now, which isn't a large number.

Even with 80 different indices, if they are all going to be >50GB, then you could consider going to quarterly batching to minimise shard count.

Currently we're running ES 5x which has an aggregations bug, so untill it's fixed with upgrade to ES 6x - we're keeping 100% data on all the ES nodes.

Yes, per month we create 39 indices. So 39index * 15months * 3primary_shards = 1755 open shards

So when we split of to 80 indices per month with current sharding, 80index * 15months * 3primary_shards = 3600 open shards.

This is when we don't have any form of batching of data.

Does the data for the different types have the same type of fields or do you have mapping conflicts across different types of data?

  • The data schema is different for 80 unique types.
  • There are a handful of fields which are common across schema. Ex. like 6-7 fields common across 35-40 types. But rest of the fields in that doc type are different which will create a sparse index issue.
  • There are like 3-4 fields across max 10 types which have conflicts in data type. Like field_a: text in type_a and field_a: float in type_b. With some work this can be resolved.

In Elasticsearch there has been improvements to the handling of sparse fields so unless there are mapping conflicts I would not immediately rule out the option to store all types in a single index and add a separate field that indicates the type that you can filter on. If you still choose to go down the route of separate indices I would recommend having a single primary shard for all indices where the shard size is unlikely to exceed a few tens of GB in size. This means that a single index may not be spread out across all nodes, but as you have 80 indices all nodes should still have a good amount of data, so I do not see this necessarily be a problem.

So, in some indices, if more than 70% of data fields are empty, will it still be a safe option to not go down the splitting-into-different-index route.

So, just to check I understand this correctly -

  • Add a field custom_type and copy es5x _type value to custom_type
  • Hardcode es6x _type to "doc"

I do not know if there is a threshold, so would recommend trying it out.

Yes.

OK.

Incase we try this out -

  • What ES metrics should we be monitoring?
  • What will be evident signs that it's not working out well?

See how it works for your application and compare the performance to the other approach.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.