You have tested with 5 nodes, so instead scaling at once up to 100x, why
not 2x first? You could test with 10 nodes or so if you can process the
double amount of docs.
A heap of 10g per node is more than enough. Heap usage will steadily grow
and adjust dynamically by garbage collection. OOM crashes only happen if
the heap is filled with persistent data and garbage collection is not
possible. How do you plan to allocate the heap? For queries with
filter/facets/caches?
You can't set up "pure" store and "pure" searcher nodes. You could use
replica with shard allocation and convince ES to place them an certain
nodes and then search on primary shards only (for example) but this setup
is tedious manual work and not worth it. Better approach when using
hundreds of machines is to use 30 day indexes with just 10 shards each (so
300 shards * n replica shards will live in the cluster) and let ES
distribute the shards automatically so the index workloads distributes
between all the nodes.
For random text data, "not_analyzed" fields do not reduce memory usage so
much. They reduce indexing time. Fields that are not stored reduce memory
usage on disk.
To find out the optimal cluster size, I'm afraid you have to run
scalability tests for yourself. Nobody can answer such a question honestly
without intimate knowledge of your data and your requirements.
Jörg
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.