I'm in the process of rebuilding our cluster and after some reading, I'm still not sure I get it right.
Current status: 3 node cluster that is used for log storage from our servers. I process them with logstash and then save them to daily indexes, separating different log types(httpd,postfix,...) with type field. We create cca 10GB-18GB of logs per day and my indexes have 6 shards with 1 replica.
What I want with a new cluster is better space eficiency, fastest indexing speed and some retention control.
I did some reading and have sort of a plan in mind, but I'm not sure if I got everything right. Please, point out any mistakes.
Space efficency #1: Currently I have all logs in a single index, separated by type. As far as I read, if an index mapping has 100 fields, all those 100 fields are saved with every indexed document. So if I have a document with 10 fields, ES will still save all 100 fields, but most of them will be empty. Is that right? If yes, I was thinking of going with a single index per type.
Space efficency #2: If I will be separating logs to type per index, some daily indexes will be really small. I was thinking of moving to weekly or even monthly indexes. Currently, some indexes don't reach 100MB/day, while the biggest ones are around 5GB/day. If I go single index per month, would 150GB-200GB index be too big? What is a recommended index size?
Retention control: If I go with log type per index, it will be easier to maintain retention policy. When one type of logs are obsolete, I can just delete index and that's it. With the current setup, the only thing I can remember is Delete By Query which is slow or reindexing. Both require lots of work or resources.
Security: I want to limit access with reverse proxy. I read on your site that it's not a 100% safe option, but I think it will be good enough for us. Right now I'm using aliases for ACL, but it will be easier for us with type per index, since I can just limit/rewrite URL and give them access to certain type of logs.
Shards / replicas: I know this is the most frequently asked question:) How many shards and replicas. AFAIK ideally, you would want 1 shard per index, since every additional shard brings some overhead with it (ES has to fetch data from all nodes and then do mapreduce). But if I only have one shard, I can't scale horizontaly. I think I need some more info from you guys:
- What is a recommended shard size? I think I read somewhere that shard size should be less than amounth of HEAP SIZE (31GB in my case).
- Does more shards mean faster indexing speed?
- Does more shards mean faster searching speed? What about replicas, are they used in searching? Can more replicas make my searches faster?
- If my index size is 5GB/month, is there any point in breaking it into more shards? I don't need the indexing performance, and I can take care of availability with replicas.
- If my index size if 200GB/month and my first point is valid, recommended # of shards would be 6-7?
- If I have multiple data.path on a single node, will one shard be saved in multiple data paths or will the whole shard be in a single path?
- Lets say I have 2 paths/node and 3 nodes. If my index has 6 indices, ES will save 2 shards/node. But will it also save each shard on a node to a different data path if there is free space?
I think that's it. I hope I'm not asking too basic questions.
Thank you all, MAtej