I will try responding inline to your questions.
- What is indices ? I've read that documents inside shards are organized in indices.
An ES index is a namespace to a collection of documents. If you are coming from relational db world, it has an analogy with the databases. ES index supports partitioning as an index can contain one or more shards as well as replication with configurable number of replicas. The benefit of partitioning or sharding is that data can be divided into multiple files and can be distributed across multiple nodes.
By default Graylog defines 20 indices , what does that mean and how do they affect the amount of data that will be stored in elasticsearch node ?
It simply means that there are 20 separate logical namespaces for holding data in ES. The maximum amount of data that can be stored in ES is independent of the number of indices, in fact it is linked to the number of shards. Since one index can have multiple shards, so the amount of data that can be stored in an index is proportional to the number of shards an index comprises of.
- In 3-node Elasticsearch scheme : 3 shards + 1 replica is a correct allocation ?
Also the necessity of replica is quite ambiguous . Why do I need a replica if my data will be indexed in a SAN drive which consists of multiple redundant SSD disks ?
You can use the 3 node ES scheme with 2 nodes having Master + Data node roles and one node with Master role only. All 3 shards will be hosted on a single data node, but it will give you future scalability option of provisioning new data nodes and relocating some shards. Generally, what I have read is that using network based storage is not recommended with ES due to performance reasons. Network storages are slower as compared to physically attached disk drives. You need replicas if due to any node failure, some of the primary shards go down than the replicas are able to respond to requests.
- Is there a specific formula how to calculate the combination of shards/indices/retention strategy ?
There isn't any simple formula and it depends upon the use case. Some simple considerations include.
i. Optimal shard size is 10-40 GB. Try keeping the shard size below 40G limit.
ii. There should be 25 shards / 1GB RAM on your ES data nodes.
Hope this helps you out.