We are using Elasticsearch 2.3. In this version, there is no limit on the maximum number of fields in an index mapping but the recommended limit is 1000 on ES 5. We want to know the reason behind it. We already have over 7000 fields in an index and are currently doing well. But we are expecting a huge increase in the near future. So,
Could anyone point us to some documentation on why we shouldn't have too many fields in an index?
What problems could occur because of this?
What is the maximum limit on the number of fields in ES 2.3?
Cluster state overhead: the mapping for each index is stored in the cluster state, which is shared among all nodes. Any change to the cluster state (such as adding a new field) causes the CS to update across all nodes. Very large mappings induces a non-negligible amount of data that needs to be sent over the wire and refreshed on all the nodes. It might seem unimportant, but having to periodically serialize a few mb of cluster state really adds up over time, and can add unwanted latency to regular actions.
Sparsity: Generally, people who have thousands of fields also tend to have very sparse fields. E.g. each document only has a handful of the thousand fields. This makes the data-structures stored on disk very inefficient (less so in newer versions, but still not ideal) because the data is so sparse. You tend to see this kind of behavior when ES is used as a blind key:value store, or where multiple tenants share the same index and are allowed to create whatever fields they want.
Lucene overhead: in short, having thousands of fields eats up a certain amount of fixed overhead
The limit is a soft-limit, so you can change it if you want. But it's there for a reason, namely that we think >1000 fields is starting to get abusive and we'd recommend trying to pare down your fields with some kind of alternate scheme.
Thanks for your answer! I have couple more questions.
Cluster state overhead: Is there a way to figure out the maximum cluster overhead (in terms of size or number of fields) that we can tolerate for a particular cluster size? We closely monitor various ES metrics and latencies of all our regular actions but do not see any huge latencies as of now. Will there be a sudden tip over if cluster overhead crosses some limit?
Lucene overhead: Can you provide more information on this? Is there a way to track this overhead as mappings size increases?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.