Rebuilding cluster, few question about shards, index organization and retention

Hello,

I'm in the process of rebuilding our cluster and after some reading, I'm still not sure I get it right.

Current status: 3 node cluster that is used for log storage from our servers. I process them with logstash and then save them to daily indexes, separating different log types(httpd,postfix,...) with type field. We create cca 10GB-18GB of logs per day and my indexes have 6 shards with 1 replica.

What I want with a new cluster is better space eficiency, fastest indexing speed and some retention control.

I did some reading and have sort of a plan in mind, but I'm not sure if I got everything right. Please, point out any mistakes.

  • Space efficency #1: Currently I have all logs in a single index, separated by type. As far as I read, if an index mapping has 100 fields, all those 100 fields are saved with every indexed document. So if I have a document with 10 fields, ES will still save all 100 fields, but most of them will be empty. Is that right? If yes, I was thinking of going with a single index per type.

  • Space efficency #2: If I will be separating logs to type per index, some daily indexes will be really small. I was thinking of moving to weekly or even monthly indexes. Currently, some indexes don't reach 100MB/day, while the biggest ones are around 5GB/day. If I go single index per month, would 150GB-200GB index be too big? What is a recommended index size?

  • Retention control: If I go with log type per index, it will be easier to maintain retention policy. When one type of logs are obsolete, I can just delete index and that's it. With the current setup, the only thing I can remember is Delete By Query which is slow or reindexing. Both require lots of work or resources.

  • Security: I want to limit access with reverse proxy. I read on your site that it's not a 100% safe option, but I think it will be good enough for us. Right now I'm using aliases for ACL, but it will be easier for us with type per index, since I can just limit/rewrite URL and give them access to certain type of logs.

  • Shards / replicas: I know this is the most frequently asked question:) How many shards and replicas. AFAIK ideally, you would want 1 shard per index, since every additional shard brings some overhead with it (ES has to fetch data from all nodes and then do mapreduce). But if I only have one shard, I can't scale horizontaly. I think I need some more info from you guys:

    • What is a recommended shard size? I think I read somewhere that shard size should be less than amounth of HEAP SIZE (31GB in my case).
    • Does more shards mean faster indexing speed?
    • Does more shards mean faster searching speed? What about replicas, are they used in searching? Can more replicas make my searches faster?
    • If my index size is 5GB/month, is there any point in breaking it into more shards? I don't need the indexing performance, and I can take care of availability with replicas.
    • If my index size if 200GB/month and my first point is valid, recommended # of shards would be 6-7?
    • If I have multiple data.path on a single node, will one shard be saved in multiple data paths or will the whole shard be in a single path?
    • Lets say I have 2 paths/node and 3 nodes. If my index has 6 indices, ES will save 2 shards/node. But will it also save each shard on a node to a different data path if there is free space?

I think that's it. I hope I'm not asking too basic questions.

Thank you all, MAtej

[quote="matejzero, post:1, topic:50066"]
What is a recommended shard size? I think I read somewhere that shard size should be less than amounth of HEAP SIZE (31GB in my case).[/quote]
No, this has never been the case. We suggest keeping shards to less than 50GB purely because moving more than that around for (re)allocation/replication is cumbersome.

Does more shards mean faster indexing speed?

Yes, because you are writing small parts to many shards.

Does more shards mean faster searching speed? What about replicas, are they used in searching? Can more replicas make my searches faster?

Depends, yes, yes.

If my index size is 5GB/month, is there any point in breaking it into more shards? I don't need the indexing performance, and I can take care of availability with replicas.

No, a single shard is fine.

If my index size if 200GB/month and my first point is valid, recommended # of shards would be 6-7?

I'd go with 4.

If I have multiple data.path on a single node, will one shard be saved in multiple data paths or will the whole shard be in a single path?

The latter.

Lets say I have 2 paths/node and 3 nodes. If my index has 6 indices, ES will save 2 shards/node. But will it also save each shard on a node to a different data path if there is free space?

It's not that smart.

If you say use nginx, and add ssl to it, and add htpasswd authentication, you should be fine.

Great!

Thank you both for answers.

One more thing I'm having troubles with. How many indexes / shards is too much (per cluster / per node). Is 600 indexes (each with 1 shard and 1 replica) too much (lets say I have 1 monthly index per log type, 10 different types and keep that data for 5 years). Older indexes are optimized at the end of the mont. Searches are mostly done inside the same log type, rarely span over different types.

Thanks, Matej

Another thing:

From the resource usage point of view, is there any difference, if I have 1 index with 2 shards and 0 replicas or if I have 1 index with 1 shard and 1 replica?

If I understand correctly, this are + and -:

  • 1 index with 2 shards and 0 replicas:
    • faster indexing
    • no HA
    • faster searching if data is in 2 different shards
  • 1 index with 1 shard and 2 replica:
    • slower indexing
    • HA
    • faster searching no matter where the data is

No, a shard is a shard is a shard.