I have more than 4000 different fields in one of my index. And that number can grow larger with time.
As Elasticsearch give default limit of 1000 field per index. There must be some reason.
Now, I am thinking that I should not increase the limit set by Elasticsearch.
So I should break my single large index into small multiple indexes.
Before moving to multiple indexes I have few questions as follows:
The number of small multiple indexes can increase up to 50. So searching on all 50 index at a time would slow down search time as compared to a search on the single large index?
Is there really a need to break my single large index into multiple indexes because of a large number of fields?
When I use small multiple indexes, the total number of shards would increase drastically(more than 250 shards). Each index would have 5 shards(default number, which I don't want to change). Search on these multiple indexes would be searching on these 250 shards at once. Will this affect my search performance? Note: These shards might increase in time as well.
When I use Single large index which contains only 5 shards and a large number of documents, won't this be an overload on these 5 shards?
There is. 4000 fields suggests your data model needs some rethinking. Do you really need to index all those fields separately?
You will need to benchmark this in your specific use case to find out for sure. You might find that reducing the number of fields in your mapping actually improves performance.
I think it more likely that you can rethink how you are modelling your data to reduce the field count. In any case 4000 fields is definitely too many.
You should avoid having too many small shards - see this article for more details. The default number of shards in future versions of Elasticsearch will be 1, as this is a more reasonable value for most cases.
Can you please tell me the problem with having index with large number of fields? I have read through the Elasticsearch Reference guide but failed to find any reason behind this. Is there any internal issues with it?
Yes, I have tried to lower the number of fields(maximum I can lower it to 2500) but I require a search to work on all of them. Even if I break my index into smaller parts I would require a search on all of them.
All fields are different and my search doesn't work on the basis of the name of field. It's a free text search which could match any field.
Seeing the above use case which type of index(single or multiple), should I prefer?
I understand it. Correct me if I am wrong, each field adds small overhead, So how could it affect us if we have more the size of RAM. Can you tell me what happens internally and give me the links of documentation where I could know about what happens internally when fields keep on increasing?
The overhead is not just RAM. There will also be a cost in terms of CPU usage, disk IO, network bandwidth and so on. Even looking at RAM usage you will find that saving RAM can use your caches more efficiently and therefore improve performance. You will, however, need to benchmark your specific situation to find out how large the effect is for your usage pattern.
You would need to dig into the internals of Lucene to find out the implementation details.
Is this true that if the number of fields in an index is less than default limit of elastic (i.e. 1000 fields) then there will be no mapping explosion?
The phrase "mapping explosion" is just another way of saying "you have too many fields". 1000 is definitely too many. The limit built into Elasticsearch is very conservative, designed to protect the cluster from egregious abuse, but frankly if you have more than 100 fields then there's probably a more efficient way to achieve what you want.
I have an entity (lets say Entity-1) which contains different types of field for different cases.
If I combine all use cases in a single index, count of fields can reach up to 4000.
When I define a unique index for each use case, field count limit reaches up to 100 per index.
As per this method, I would have around 40 different indexes. Also, we maintain their older versions copy which makes index count to 80.
Now I will have at least 50 such entities which mean Entiry-1, Entity-2 , Entity-3 and so on.
This means my total index count will reach to 50 (Entity) * 80 (Index per entity) = 4000 indexes
In this scenario, I am going with the approach of one index with larger fields than more indexes with lesser fields. So at last, I have only 50 (Entity) * 2 (Index per entity) = 100 indexes
I have read about copy_to but if I have 1000 fields in an index and using copy_to will only add an extra field and make it to 1001 fields. Correct me if I am missing something here.
There's a difference between fields that are presented and stored in the source JSON and those that are copied into indexed fields stored in the Lucene search index. The mappings definitions control this translation (hence the name "mapping").
David's suggestion is that you copy 1000 fields' values into 1 indexed field inside Lucene. As an extreme example you could have only 2 fields in Lucene - a stored one called _source with a big blob of your JSON and a single indexed (ie searchable) field called something like all_my_string_fields_indexed_as_one.
You would choose to add dedicated indexed fields for selected JSON field if you need to aggregate or search on them independently.
The hypothesis I can build out of your response is :
When 1000 fields are indexed, with the help of copy_to, these fields' values are indexed into one single field in Lucene.
Lucene would now contain only two fields i.e., _source which contains my JSON in its original format and all_my_string_fields_indexed_as_one which is searchable.
Now on a Free text search, only all_my_string_fields_indexed_as_one would be referred to. Hence, my free text would practically be on a single field all_my_string_fields_indexed_as_one
In case copy_to is not used, I would have 1001 fields in Lucene, _source which contains my original JSON and 1000 fields which I have indexed that are searchable. So now on Free text search, all these 1000 fields are searched upon.
This is what I can apprehend. Am I missing something?
Your understanding is correct. The important thing to add is that copy_to will copy a JSON field's value to a choice of indexed field but by default you'll still index the original JSON field with it's JSON name. You need to set the index property to false to prevent this default behaviour. So each JSON field needs:
copy_to set to target something like"all_my_string_fields_indexed_as_one" field and
index property set to false to prevent indexing of the JSON field with an indexed field of the same name
By using copy_to, Can I search and aggregate on individual fields only? If this is not the case, Will I have to set index property to true to search on them individually? This implies, number of fields stored in Lucene would again be 1000 fields if i need to provide search and aggregation on individual fields
If you want to search on a field individually, yes, you need an individually indexed field.
If you want to aggregate on a field individually you need doc_values enabled for that individual field.
These data structures do not come for free so it shouldn't come as a surprise that the costs involved can be a multiple of how many fields for which you chose to enable these data structures.
I have one more doubt. If I have multiple indexes in which distinct fields for all indexes is 3000-4000 fields.
Now, if one single query searches on all index at once, can there be chances of mapping explosion?
As I defined to your colleague Nikesh here, I consider a "mapping explosion" to not be a specific event but a general condition of having a lot of fields.
By that definition, yes, you will have a lot of fields.
If you want to consider the impact of field numbers on something more specific (query response times, disk space, RAM utilisation, indexing speeds...) I suggest you perform some benchmarking.
Thanks @Mark_Harwood for your response. Yes, even Nikesh is confused about the situation.
I would like to rephrase this question, When I divide a single index (which contains about 3000-4000 fields) into multiple indexes(which contains around 100 fields each), and search over all these indexes at once, the search would still be on about 3000-4000 fields is what I assume. Would there be any change in search performance between single large index and multiple smaller indexes?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.