I am new to elasticsearch so forgive me if this question is stupid. We have millions of json documents in Azure blob storage that we would like to index in Elastic. After importing just a thousand we've reached the default field limit of 1000. I have noticed that we have a lot of objects that are of the same type, for example IDs. Is there a way to avoid creating new fields for every ID in a document?
You should use the dynamic index setting for that. During dev you can perhaps increase the total fields limit setting just so you have the index mapping written for you; but you really want to avoid mapping explosion.
Another thing to consider is remodeling your data in a way that will better reuse the recurring fields, seems like lineId , salesOrderLineId are in fact the same objects and maybe there's a better way to model this for Elasticsearch.
Can you provide some background for this initiative, please: are you planning to search on each and every field? To aggregate? To order by? Is nested structure mandatory, etc.
I think by default Elasticsearch should create two fields, unless you pre-defined your mappings: text for full text search and keyword field for exact matches and aggregations, so I would expect to see schemeAgencyName and schemeAgencyName.keyword.
We are in an explorative phase at the moment. We don't know which fields we would like to perform queries on, which to aggregate etc., so we thought we would index it all as a test. Unfortunately we think that a nested structure is needed.
Yes, nested sounds reasonable approach and I agree with @Itamar_Syn-Hershko to use dynamic mapping and also limit it in depth.
To avoid potential mapping explosion perhaps you need to consider what fields to exclude from mapping. But in my mind this type of initiative should be driven by product management team, who should prescribe what exactly the customers may require.
There's a hint of just blindly indexing data here without really first considering what the data represents, and how, and then what you are going to do with it. and a sort of implicit assumption the data is all "similar" enough.
The "millions of json documents in Azure blob storage" needs at least some pre-analysis first, check what's there. e.g. on the IDs, maybe at some point in the lifecycle data what was previously known as salesOrderId was renamed to salesOrderLineId, or other changes that you cant possibly know by looking at just a small subset, as you found.
The "let's get it all into ES and figure it out later" approach is just unwise IMHO. YMMV
It's fair play to ingest everything into ES with limitations removed to explore mapping and structure. In Dev mapping explosion is not a concern. The remodeling discussion only comes then, when everything is laid out in front of you. In many cases the schemas are not well defined and understood at that stage, and having it all in one huge mapping actually often helps.
However, I'd try to avoid nested in the final design if possible.
In essence we're disagreeing about when to do effectively the same work - which is to better understand the source data.
I can easily be persuaded there are cases when having all the data into one place helps with that. But this isn't IMHO one of them, the data is all in one place already (millions of json documents are in Azure blob storage). Dumping it into ES means you can analyze it in ES, which is good, But (though this is a Elastic forum) we must be minded to not fall into the trap that elastic is the only tool in town. Plus just getting all the data into ES might be far from trivial, and certainly has non-zero cost.
@RainTown I agree. If you have a lot of data with reasonably uniform mappings and want to explore the data I would recommend importing it into Elasticsearch and then run analysis.
In this case it unfortunately sound like there are very different schemas across the full data set, which could potentially result in a lot of mapping conflicts (in addition to large field count) that would need to be resolved in order to even get the data into Elasticsearch in the first place. I suspect it would likely be more efficient and lower effort to analyse the schema of the total data set where it is currently stored and then find a way to transform and load it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.