Our use case allows customers to create tiny indices, which means there might be too many shards in a cluster (1 shard / index since these are tiny). Each index might have less than 10k records.
I was wondering if its ok to have small shards. One cluster has typically 5 - 8 data nodes and an additional 3 master nodes. I understand that the answer depends on many factors like fielddata, analyzed strings, etc. Hence, let's assume an index has double and date fields only, no fielddata.
What would be considered too much / node? Are 10k shards / node too many?
That is likely to be very inefficient and lead to a large cluster state. Please read this blog post on the topic.
I have seen users try to scale multi-tenancy this way in the past on a few occasions, and although there is no hard limit on index and shard count in Elasticsearch, it has generally scaled and performed badly, even when multiple, small 3-node clusters have been used in order to reduce the size of the cluster state.
Thanks Christian. This post has a lot of details, thank you for writing it.
It kind of scares me because the post is heavy on search usage, while we use it for data analytics, which means we cannot combine indices as mentioned in the post since these indices have different mapping. That gives me an impression that we shouldn't use ES.
How large are your documents? How many fields do these documents typically have? Did all different types of documents have different fields? Are there still mapping conflicts, given that you stated that only a few different data types are used?
Depends. Let me know if there is an easy way to get this info
How many fields do these documents typically have
Typically 100-400 with some exceptions
Did all different types of documents have different fields?
all documents in one index has same fields
Are there still mapping conflicts, given that you stated that only a few different data types are used?
Not sure what a mapping conflict is. We typically use date, double, keyword (with lowercase normalizer) and rarely text with fielddata enabled (with analyzer)
All our nodes have 30 GB total memory, 15 GB for each JVM and Lucene. 16 cores.
You can place different types of documents in the same index as long as they do not contain the same field with different mappings. All documents in an index does not need to contain all fields. Do these different types of documents have a lot of common fields?
Every index has fixed set of fields (fixed schema and structured). So, the fields will be exactly the same for all documents in one index.
If we were to combine multiple indices into one by using different types, would that work better for ES (see below). We can't combine keeping the same types since the field names can conflict (probably this is what you meant by mapping conflict)
Right now:
index1/type1/document
index2/type1/document
After change (is this better for ES relatively?)
index1/type1/document
index1/type2/document
The ability to have multiple types in an index is going away, so if you are designing the system now I would stay away from using multiple types. You can however create a field on each document indicating the type and use this for filtering, similar to how you would only query a specific type.
Having indices with sparse fields in generally bad (takes up more disk space) in Elasticsearch 5.x, but this is getting improved in the upcoming Elasticsearch 6.0.
You can however create a field on each document indicating the type and use this for filtering, similar to how you would only query a specific type.
By type, do you mean data type? An example would help. I am thinking about a case where two indices (to be combined) have same name of the field with same data type.
BTW, disk space is not a problem for us. Seems like memory is one of the problems.
If you have the same field name in multiple indices with the same mapping/data type, this is what I refer to as common fields. What you need to look out for is indices with the same field name but different mappings/data types, as these will not be able to share an index.
I guess that becomes a complex solution because now we have fields that were in the same index distributed across indices (if I understand your suggestion correctly). Aggregations and filtering becomes a challenge.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.