Trying out the new Kibana Code Application, and checking the indices created by the application made me realise the current (7.2.1 for me) version does not efficiently handle resources. For every respository indexed, Kibana seems to create 4 indices:
health status index pri rep docs.count docs.deleted store.size memory.total
green open .code-document-automation_ansible_servers-vmware-a6c91e64-1 1 1 4 7 64.6kb 12.9kb
green open .code-document-configuration-elastic-6a94d4f7-1 1 1 348 345 1.7mb 21.9kb
green open .code-document-github.com-outsideit-firemotd-cb806c17-1 1 1 27 7 356.7kb 25.9kb
green open .code-reference-automation_ansible_servers-vmware-a6c91e64-1 1 1 0 0 566b 0b
green open .code-reference-configuration-elastic-6a94d4f7-1 1 1 0 0 566b 0b
green open .code-reference-github.com-outsideit-firemotd-cb806c17-1 1 1 0 0 566b 0b
green open .code-symbol-automation_ansible_servers-vmware-a6c91e64-1 1 1 0 0 566b 0b
green open .code-symbol-configuration-elastic-6a94d4f7-1 1 1 0 0 566b 0b
green open .code-symbol-github.com-outsideit-firemotd-cb806c17-1 1 1 0 0 566b 0b
green open .code_internal-worker-queue-2019-08-04 1 1 41 12 173.4kb 42.2kb
green open .code_internal-worker-queue-2019-08-11 1 1 297 15 327.7kb 70kb
green open .code_internal-worker-queue-2019-08-18 1 1 137 20 138.2kb 20.7kb
Is there some way to index all repostory data into 1 index somehow? As we have about 200 repositories to index, this would require me to reserver 1200 shards...
Thanks for reaching out. While creating shards does come with a bit overhead, generally having 1200 shards on a cluster is not abnormal, and shouldn't bring any noticeable performance change. Putting different document types into different indices helps optimize query time as well as keeps implementation clean.
Could you elaborate a bit more what is your concern about having 1200 shards?
Afaik shards definitely come with overhead... I'm running a 6-node cluster with +- 4500 shards. Calculated that we can have about 750 shards per node with 32 GB heap. If we would use the Kibana Code Application and require an extra 1200 shards for the Code indices that are only a few megabytes large... This would imply we can't use the heap memory for other shards which do efficiently use the available shards (+- 40 - 50 GB / shard)... ?
See also https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
Hey @willemdh. Sharding and overhead is a complicated topic. Much of the advice like that blog article are rules of thumb rather than hard/fast rules.
It's true that over-sharding is a problem, and that shards have a certain amount of overhead. But it's also true that mainly empty shards have considerably less overhead than "sorta-full" shards. It's unfortunately not as simple as dividing your available heap by the number of shards.
What I mean by that is 200gb of data spread of five shards (40gb per shard) is more efficient than 200gb spread of 50 shards (4gb per shard). That's because:
- Fixed overhead of the shards themselves (relatively small generally)
- Reduced compression because the global set of terms are now spread over 20 independent shards, which means a lot of redundant duplication of term data across each shard. If the shards were larger, you waste less duplication in the inverted index
- To a degree, reduced compression in doc values too for similar reasons. Not quite as extreme as the inverted index because the nature of doc value compression is a little different
- More segments, which means the merge scheduler has more work to do. Similarly, more segments == more segment churn which means reduced effectiveness of the OS disk cache
- Similarly, ES only caches filters if the segment is greater than a certain size (to prevent caching on short-lived segments). The majority of segments are below that threshold, meaning you are defeating internal performance enhancements
- Cluster state size. The master needs to keep track of all indices (and their mappings) and shards, and publish that out to the nodes.
So all of this contributes to the rule of thumb about shard size, etc. But many of those items become less important when we are talking about very small shards. Spreading 10g of total Code data over one shard or 10 shards is probably not much difference.
Code shards should theoretically stay very small, and there is likely not a lot of "data overlap" between different repositories, so all the notes about compression are less relevant. E.g. due to size and lack of shared data, one shard vs 10 shards doesn't make much difference. Ditto to caching effects, which probably don't matter since it's unlikely that Code shards will be receiving thousands of queries/sec like an e-commerce search or something.
So the main considerations here are cluster state size and fixed overhead. But these are weighed against other logistical concerns like data management. E.g. in a "combined" index scenario, deleting a repository would become very expensive as it requires a Delete-By-Query and all the associated merge pressure, whereas index-per-repo is fast and simple delete.
Hope that helps give some context about the decision to do an index-per-repo. It's not to say Code won't support other storage mechanisms in the future, that's just the decision the team went with for launch. And I think 200+ repos to index is on the more extreme cases considered, so it might not be quite fitting the original design considerations.
@polyfractal Thanks for the detailed explanation. Expensive 'Delete-By-Query's are a good reason to keep them separate..
Np, happy to help! It's definitely a complex topic, and unfortunately one of those "it depends" situations
Let us know how your experiments with Code go! As I said, I think the scale of your deployment (200+ repos) might be on the edge of the original design considerations, so there might be other sticky points you run into. We'd be eager to hear what works and doesn't
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.