Elasticsearch data structure & changing mapping

Hi,

We have an application around data visualisation running purely from Elasticsearch and have a question regarding the our current data structure within Elastcsearch and how that can lead to mapping explosion.

The data structure looks like below:

As you see in the picture, this is a multi tenancy application and each index holds the data related to a particular client.

Again, as per the picture, we have data_1, data_2, data_3, etc. which holds completely different types of data which the client uses for visualisation/reporting purpose. And we don't have any control on the type of data that goes under these keys as its uploaded by the client and it differs from client to client. So we have applied dynamic mapping to each of these indices to hold any type of data (changing key value pair) as you see in the pic.

As you can see, since we have 5 types of mapping for each of these fields under data_*, the total data mapping in the index grows to a very large number (20,000 in one of our clients with 10 data sets) and we expect this to grow a lot more for bigger clients with hundreds of data sets which can lead to mapping explosion.

  1. What should be the data structure in a scenario like this? Should we move the data sets (data_1, data_2 etc.) as indices? If so, how can we apply some other logical separation for each client as 'types' are no longer available in latest version of Elasticsearch?

  2. Also what would be the maximum recommended value for index.mapping.total_fields.limit setting? Right now we have set it to '100000000' to accommodate big data sets.

On top of that we will have new features built frequently and the associated data has to be stored in Elasticsearch as well. We cannot afford to have a down time every time we release a new feature to change the mapping to include this new feature's data, hence we have created a key 'features' (as in the pic) with dynamic mapping like 'data_*' so that we can query the data under this 'features' key.

  1. Is there a better approach to minimise downtime while introducing these types of new mapping requirements?

Thanks,
Arun.

1 Like

The default limit of 1000 fields per index is set quite high, so going far beyond this is not recommended. Setting it to the levels you are mentioning is almost guaranteed to cause problems as the cluster state and mappings will get very large.

If you have an API layer in place between your tenants and Elasticsearch I would recommend mapping client fields to a set of standard field names per data type for each data set. You can then translate as required and use this to keep the size of the mappings down. Tenant 1 could then have a field ´X1´ mapped to the standard field ´string_field_1´ for a particular data set while the same tenant (or other) might have a different field ´Z2´ mapped to the same standard field for a different data set. This will affect relevancy and require more work at the API layer, but is likely to scale much better.

Changing the current scheme to have instead increase the number of indices is also not likely to scale well as each shard comes with a certain overhead.

Although this old blog post refers to a very old version of Elasticsearch, I suspect a lot of it still applies.

1 Like

Hi Christian,

Really appreciate your quick and detailed response.

Are you suggesting to remove the logical separation between data sets (such as data_1, data_2 etc.) and have a pre-defined set of fields (say 5000 fields) under a generic key called 'data' which should accommodate all the different data sets applicable per document?

I picked 5000 here, because we already have a scenario where we need to store 3500 fields for a data set from a client?

Also looking at this requirement in general, would you think that we are trying to stretch the use case of Elasticsearch far beyond that its benefits for our application requirements here?

Thanks so much again,
Arun.

Yes. That is an approach I have seen used in the past for this type of requirements.

That is impossible to tell as I know nothing about the use case or query requirements.

1 Like

Thanks Christian, appreciate it.

What I meant by use case is, constantly changing schema and large number of fields(5000 to 10,000) required to store and analyse the big data.

Also, the application has to be built as a Multi tenancy SaaS platform with frequent updates to data mapping as we release new features.

I don't know if I have given you enough details to think about the scenario.

Thanks,
Arun.

If you go with a standard set of fields you can create a static mapping and will not need to update mappings. If the data sets are small but the documents complex it might be that some simple key-value store that does not impose a schema and which you can query using map-reduce style jobs could work quite well. It all comes down to how you need to query the data , how quick you need the response and the size and distribution of data sets.

It might also be worthwhile looking into how you end up with so many fields. Are these distinct fields or do they contain auto-generated key names (e.g. dates) that might better be mapped some other way? What does data_1 in your example correspond to?

1 Like

data_1 refers to a data file the client uploads (or a live stream of data from a business unit such as call centre/user feedback) which will have thousands of operational metrics/question responses which he needs to analyse together (aggregate queries). The main reason we chose elasticsearch is we wanted a nearly close schemaless storage and fast real time reporting for aggregations.

The data itself under each data sets could be in hundreds of millions as well.

One thing to note here is, 95% of the stats or calculations we run are sum, average, top counts etc. for different combinations of filters in data and are displayed as scores in the application. And the text search only constitutes to 5% of the total queries. The accuracy of these scores are very very important for us.

Thanks for your help,

Why have each question be a specific key? You would probably have a lot less fields if you created a question field with an id field underneath that contains the current field. This will require changing the data structure though. Elasticsearch is not schemaless so have problems in scenarios like this unless the format of the data can be optimised.

1 Like

Thanks Christian.

Are you saying a structure like below would help?

ES_Structure_details_3

Will there be any performance impact on the above?

On your experience how many columns/key can Elasticsearch handle with documents spanning to 100 million?

Thanks again,

If you denormalize your data, you may want to split each document into a document per part instead of having large nested arrays.

1 Like

Thanks Christian.

I need to ask you one more thing regarding the number of column mappings. At present, there is an index with 9766 column mappings and another index with 8870 (API used index/_mapping/field/*) and this will most likely grow in the near future.

Is it okay to have this many mappings? also please comment on the performance impact?

Thank you.

I do not know the limits as I tend to stay within the default limits.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.