im basically about to turn my 4 months testing system into production but the only concern is whether I decide correctly on initial mapping and shard allocation. My plan is to store using logstash logs from approximately 15 sources (applications) from csv into elasticsrarch. Monthly it takes approx 30 (2mil each) mil documents all together. I would like to have it available one year (after just to close it and open when needed).
Im use kibana4 and huge aggregations (unique count etc.) and I dont have extra powerfull cluster (only 70GB RAM, 10 nodes).
- What would be best setting of shards and replicas for performance for search (to have one index each month with all sources and replicated it 10 times or to have 1 index per source and month and splitted into multiple shards ???)
- Im planning to use use doc_values for fields that dont need to be analyzed - can I set doc_values for .RAW field to benefit from analyzed fields and doc_values from non analyzed fields ?
- Im planning to have 2 master nodes with 1/2 of RAM to HEAP(4+4 GB) and rest will be slaves with all memory to HEAP (each approx 8-16GB) - Is it suitable?
- Logstash will run only on one master, because all the csv will be stored there
- It is worthy to deal with removing fields as _path, _souce, _message etc automatically generated by logstash in terms of speed?
Thanks for any kind of advice