Scaling an Elasticsearch cluster to >100 billion documents, 25k/sec indexing rate

williamejsalt · September 26, 2013, 11:17am

Hi all,

Im currently running an elasticsearch cluster thats storing and consuming
logs from logstash.
My architecture is:

x3 m1.small logstash nodes, consuming from a local redis queue, indexing
to elasticsearch through the http_elasticsearch output
x5 m1.xlarge elasticsearch nodes with 10gb of heap assigned to the JVM
running elasticsearch

node stats:

gist.github.com

https://gist.github.com/willejs/083835bb4b14e938a0cf

gistfile1.txt

{
    "cluster_name": "logstash",
    "nodes": {
        "sObjE_2bSHmVjVTu3BOcVA": {
            "timestamp": 1380193218419,
            "name": "hostname",
            "transport_address": "inet[/10.0.20.201:9300]",
            "hostname": "hostname",
            "attributes": {
                "aws_availability_zone": "us-east-1a",

This file has been truncated. show original

node config:

gist.github.com

https://gist.github.com/willejs/f765425a7e265deb2627

gistfile1.txt

######################### ElasticSearch Configuration  ########################

# This file is managed by Chef, do not edit manually, your changes *will* be overwritten!
#
# Please see the source file for context and more information:
#
# https://github.com/elasticsearch/elasticsearch/blob/master/config/elasticsearch.yml
#
# To set configurations not exposed by this template, set the
# `node.elasticsearch[:custom_config]` attribute in your node configuration,

This file has been truncated. show original

I am running elasticsearch 0.90.1, only indexing at around 250 documents a
second, storing roughly 1billion documents, 1 index per day, retaining
documents for 30 days.

I haven't performed any optimisation apart from setting the field data
cache size to 40%.

I am seeing around 8gb of heap usage per node, and this is slowly rising
with the amount of documents i am indexing. I have kept adding nodes when
cluster nodes run out of heap and crash, I cant see where the heap is being
used in the node stats section, im guessing its some sort of caching? I now
want to tune the cluster to reduce ram usage.

Im planning to scale this cluster up x100, so i need to now heavily tune
it, im going to give this a go myself first, before i start paying for
support/a consultant.

My questions are:

Can i realistically reduce heap usage in its current state?
Do i want to change the architecture to the cluster some some nodes are
purely stores, whilst others are meant for searching etc?
Will changing fields to not analysed reduce memory usage?
What kind of architecture and resources would a cluster of this size
consume?

Any help on this would be greatly appreciated.
cheers
Will

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · September 26, 2013, 7:19pm

You have tested with 5 nodes, so instead scaling at once up to 100x, why
not 2x first? You could test with 10 nodes or so if you can process the
double amount of docs.

A heap of 10g per node is more than enough. Heap usage will steadily grow
and adjust dynamically by garbage collection. OOM crashes only happen if
the heap is filled with persistent data and garbage collection is not
possible. How do you plan to allocate the heap? For queries with
filter/facets/caches?

You can't set up "pure" store and "pure" searcher nodes. You could use
replica with shard allocation and convince ES to place them an certain
nodes and then search on primary shards only (for example) but this setup
is tedious manual work and not worth it. Better approach when using
hundreds of machines is to use 30 day indexes with just 10 shards each (so
300 shards * n replica shards will live in the cluster) and let ES
distribute the shards automatically so the index workloads distributes
between all the nodes.

For random text data, "not_analyzed" fields do not reduce memory usage so
much. They reduce indexing time. Fields that are not stored reduce memory
usage on disk.

To find out the optimal cluster size, I'm afraid you have to run
scalability tests for yourself. Nobody can answer such a question honestly
without intimate knowledge of your data and your requirements.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
ES HEAP // ES infrastructure by 10 billion docs per year Elasticsearch	1	675	July 5, 2017
Large heap usage with each node Elasticsearch	15	3758	July 5, 2017
How to optimise heap usage on elasticsearch nodes? Elasticsearch	9	756	November 10, 2020
Elasticsearch performance tuning on elastic 1.7 Elasticsearch	3	948	July 5, 2017
Elasticsearch Memory issue Elasticsearch	10	1706	July 6, 2017

Scaling an Elasticsearch cluster to >100 billion documents, 25k/sec indexing rate

Related topics