Data modelling dilemma: Number of indices vs Sparse index

Jay_Dihenkar · November 27, 2018, 1:16pm

We have an ES 5x setup and we create around 39 indices on monthly basis.

Index distribution in the setup - https://gist.github.com/jay-dihenkar/80c9d3774b05027bd0f5db37df57b555

Cluster details -
Nodes = 3
Heap = 31g

We're keeping 15 months of data open in ES for analytics purpose.

Now, with ES 6x, we cannot have multiple types in a single index.

So on an average, if we break down these indices into a TYPE of its own, the count comes to around 80 indexes per month from 39.

This will be an ideal way for data modeling.

We can make a hack to add a custom type field say custom_doc_type and keep indexing the same docs in the existing index. But ultimately it'll be a violation of data modeling rule and sparse index will be created.

Our dilemma is what's the ideal way to approach this problem, so we can keep the number of shards and open file in check while at the same time adhere to data modeling guidelines?

This data is only queried via Kibana and used for analytics purpose.

warkolm · November 28, 2018, 8:03pm

Is there a reason you have 2 replicas? Also, you can easily reduce the shard count on all of those indices to 1 primary, that will reduce the number of shards you have.

Reduce primary count, as above.
It's not clear if those are all monthly indices, as we cannot see the names, but you only have 240ish shards now, which isn't a large number.

Even with 80 different indices, if they are all going to be >50GB, then you could consider going to quarterly batching to minimise shard count.

Jay_Dihenkar · November 29, 2018, 7:52am

Currently we're running ES 5x which has an aggregations bug, so untill it's fixed with upgrade to ES 6x - we're keeping 100% data on all the ES nodes.

Yes, per month we create 39 indices. So 39index * 15months * 3primary_shards = 1755 open shards

So when we split of to 80 indices per month with current sharding, 80index * 15months * 3primary_shards = 3600 open shards.

This is when we don't have any form of batching of data.

Christian_Dahlqvist · November 29, 2018, 8:00am

Does the data for the different types have the same type of fields or do you have mapping conflicts across different types of data?

Jay_Dihenkar · November 29, 2018, 8:10am

The data schema is different for 80 unique types.
There are a handful of fields which are common across schema. Ex. like 6-7 fields common across 35-40 types. But rest of the fields in that doc type are different which will create a sparse index issue.
There are like 3-4 fields across max 10 types which have conflicts in data type. Like field_a: text in type_a and field_a: float in type_b. With some work this can be resolved.

Christian_Dahlqvist · November 29, 2018, 8:20am

In Elasticsearch there has been improvements to the handling of sparse fields so unless there are mapping conflicts I would not immediately rule out the option to store all types in a single index and add a separate field that indicates the type that you can filter on. If you still choose to go down the route of separate indices I would recommend having a single primary shard for all indices where the shard size is unlikely to exceed a few tens of GB in size. This means that a single index may not be spread out across all nodes, but as you have 80 indices all nodes should still have a good amount of data, so I do not see this necessarily be a problem.

Jay_Dihenkar · November 29, 2018, 8:34am

So, in some indices, if more than 70% of data fields are empty, will it still be a safe option to not go down the splitting-into-different-index route.

So, just to check I understand this correctly -

Add a field custom_type and copy es5x _type value to custom_type
Hardcode es6x _type to "doc"

Christian_Dahlqvist · November 29, 2018, 9:04am

I do not know if there is a threshold, so would recommend trying it out.

Yes.

Jay_Dihenkar · November 29, 2018, 9:13am

OK.

Incase we try this out -

What ES metrics should we be monitoring?
What will be evident signs that it's not working out well?

Christian_Dahlqvist · November 29, 2018, 10:24am

See how it works for your application and compare the performance to the other approach.

system · December 27, 2018, 10:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Design approach for Many Small Sized but very different indices? Elasticsearch	6	491	May 6, 2019
Sparse index Elasticsearch	3	1538	August 29, 2017
Is it ok to have too many tiny shards in a cluster? Elasticsearch	11	2655	December 1, 2017
Metrics and sparse indexes Elasticsearch	1	718	September 15, 2017
Too many smaller indices. shards is creating issue Elasticsearch	20	228	October 23, 2024

Data modelling dilemma: Number of indices vs Sparse index

Related topics