How to avoid duplicate fields?

gorano79 · January 21, 2025, 1:26pm

I am new to elasticsearch so forgive me if this question is stupid. We have millions of json documents in Azure blob storage that we would like to index in Elastic. After importing just a thousand we've reached the default field limit of 1000. I have noticed that we have a lot of objects that are of the same type, for example IDs. Is there a way to avoid creating new fields for every ID in a document?

Here is an example:

     "lineId": {
            "id": "8",
            "schemeId": null,
            "schemeName": null,
            "schemeAgencyId": null,
            "schemeAgencyName": null
          },
     "salesOrderLineId": {
            "id": "1",
            "schemeId": null,
            "schemeName": null,
            "schemeAgencyId": null,
            "schemeAgencyName": null
     },

Itamar_Syn-Hershko · January 22, 2025, 7:16am

You should use the dynamic index setting for that. During dev you can perhaps increase the total fields limit setting just so you have the index mapping written for you; but you really want to avoid mapping explosion.

Another thing to consider is remodeling your data in a way that will better reuse the recurring fields, seems like lineId , salesOrderLineId are in fact the same objects and maybe there's a better way to model this for Elasticsearch.

kdwolf · January 22, 2025, 9:08am

Can you provide some background for this initiative, please: are you planning to search on each and every field? To aggregate? To order by? Is nested structure mandatory, etc.

I think by default Elasticsearch should create two fields, unless you pre-defined your mappings: text for full text search and keyword field for exact matches and aggregations, so I would expect to see schemeAgencyName and schemeAgencyName.keyword.

gorano79 · January 22, 2025, 1:52pm

We are in an explorative phase at the moment. We don't know which fields we would like to perform queries on, which to aggregate etc., so we thought we would index it all as a test. Unfortunately we think that a nested structure is needed.

kdwolf · January 22, 2025, 2:05pm

Yes, nested sounds reasonable approach and I agree with @Itamar_Syn-Hershko to use dynamic mapping and also limit it in depth.

To avoid potential mapping explosion perhaps you need to consider what fields to exclude from mapping. But in my mind this type of initiative should be driven by product management team, who should prescribe what exactly the customers may require.

RainTown · January 22, 2025, 2:24pm

I agree with @kdwolf here.

There's a hint of just blindly indexing data here without really first considering what the data represents, and how, and then what you are going to do with it. and a sort of implicit assumption the data is all "similar" enough.

The "millions of json documents in Azure blob storage" needs at least some pre-analysis first, check what's there. e.g. on the IDs, maybe at some point in the lifecycle data what was previously known as salesOrderId was renamed to salesOrderLineId, or other changes that you cant possibly know by looking at just a small subset, as you found.

The "let's get it all into ES and figure it out later" approach is just unwise IMHO. YMMV

Good luck.

Itamar_Syn-Hershko · January 23, 2025, 7:19am

It's fair play to ingest everything into ES with limitations removed to explore mapping and structure. In Dev mapping explosion is not a concern. The remodeling discussion only comes then, when everything is laid out in front of you. In many cases the schemas are not well defined and understood at that stage, and having it all in one huge mapping actually often helps.

However, I'd try to avoid nested in the final design if possible.

RainTown · January 24, 2025, 1:01pm

Very happy to disagree agreeably.

In essence we're disagreeing about when to do effectively the same work - which is to better understand the source data.

I can easily be persuaded there are cases when having all the data into one place helps with that. But this isn't IMHO one of them, the data is all in one place already (millions of json documents are in Azure blob storage). Dumping it into ES means you can analyze it in ES, which is good, But (though this is a Elastic forum) we must be minded to not fall into the trap that elastic is the only tool in town. Plus just getting all the data into ES might be far from trivial, and certainly has non-zero cost.

Christian_Dahlqvist · January 24, 2025, 1:22pm

@RainTown I agree. If you have a lot of data with reasonably uniform mappings and want to explore the data I would recommend importing it into Elasticsearch and then run analysis.

In this case it unfortunately sound like there are very different schemas across the full data set, which could potentially result in a lot of mapping conflicts (in addition to large field count) that would need to be resolved in order to even get the data into Elasticsearch in the first place. I suspect it would likely be more efficient and lower effort to analyse the schema of the total data set where it is currently stored and then find a way to transform and load it.

Topic		Replies	Views
Can nested fields prevent mapping explosion? Elasticsearch	3	5647	August 31, 2017
Indexing documents with nested fields Elasticsearch	4	424	March 15, 2019
Understanding field limit across index pattern and solution to mapping explosion Elasticsearch	5	1695	September 23, 2019
Elastic Mapping explosion Elasticsearch	12	4510	February 11, 2019
Elasticsearch data structure & changing mapping Elasticsearch	12	2163	August 12, 2019

How to avoid duplicate fields?

Related topics