ES for logging - what to look after with high indexing rate

Hello everybody,

first post here in the community.

I setup an ES 5.4.0 cluster on AWS made of 5 data-nodes and 3 masters (only), to collect all the logs from our application.
The data nodes are pretty OK servers (8CPU, 32GiB), 1 SSD volume of 2TB for data nodes, and 100GB for masters, all EBS optimized.

Heap set to 16GB, and I basically followed what the guide says about setting it up.

I have a daily index rotation, with 6 indexes per day.
Default template is set with 2 shards and 2 replicas, whereby 3 of them have 5 shards and 3 replicas (but thinking about having 1 replica only to save space, once cluster is stable).
best-compression is enabled on all indexes.

I am using td-agent (fluentd) to ship logs directly to ES; no parsing is needed as they are already in JSON format.
I started with dynamic mapping but for the biggest indexes (read below) I will move to manual mapping to save space, and hopefully gain a bit on speed.

Problem is, those 3 indexes are quite big, and the indexing rate is quite high.

One index is already at production level, and those are the numbers:
8 docker instances writing at ~2000/s (primary shard) and 30GB data so far (I am looking at today's index via Kibana monitoring); refresh_interval at 15s, dynamic mapping off and _all field off.

The other 2 indexes are as follows:

1 server out of 8:

  • first index at 1200/s rate (primary shard) with 60m documents and 35GB space;
  • second index is at 500/s with 20m docs but 450GB data (unfortunately those are quite big log files)

refresh_interval for the above 2 indexes is at 30s, dynamic mapping still on.

Problem is, if I enable fluentd on all 8 servers, the data nodes can't cope with the amount of incoming logs.

Few things I can (and should) optimize:

  • disable dynamic mapping on the latter indexes, and disable _all field to save some space;
  • disable best-compression (I read it consumes processing power, so better to set it off and enable it for yesterday's indexes while running force_merge)

Current conf on elasticsearch.yml:

# Get pool
thread_pool.get.queue_size: 200
# Index pool
thread_pool.index.queue_size: 600
# Search pool
thread_pool.search.queue_size: 200
# Bulk pool
thread_pool.bulk.queue_size: 1000

Currently, for JVM I see the following values (read it from one of the data nodes):

GC old is 0, young at around 2500ms avg
GC count old 0, young at around 125
Indexing time avg 200000ms (4 minutes)

I know the above conf does not really help as it hides the true root cause, but reading here and there I found I was getting rejected documents so I increased the bulk.queue _size and lowered the searches (we don't do much searches on it).

I read the part about tune for indexing speed and one thing I didn't understand was the "indices.memory.index_buffer_size" value.

Where should I set this and most importantly how can I see the current value?
Most likely it needs to be set on a per-index basis meaning I would have to update my templates including that parameter.

Is there any other tips/values you can suggest me to look after and adjust based on the incoming flow?

The cluster can keep up with 2 of 8 servers sending data, but I tried setting 4 of them and I see load on the data nodes going above 8.00 avg (which should be their max given they are 8-core servers).

I would like to understand where the bottleneck of my cluster is; if it's related to some setting I forgot to configure or simply there is too much data needed to be indexed and I should scale-out the data nodes.
For the sake of completeness, at first I setup a similar cluster using AWS ES and Kinesis Firehose to send data, but I had problems with space as the could have only 1 disk per server with maximum of 1.5TB space; plus it was too expensive to keep it running so I decided to go with EC2 instances and a load balancer on top to connect only the master nodes (as I later understood there is no need to have all the nodes connected to the load balancer, only the masters are enough).

Thank you in advance,
Lorenzo

For logging use cases I rarely see anyone running with anything other than 1 replica. The reason for this is that having lots of replicas do increase storage requirements and reduces the indexing throughput as primary and replica shards basically do the same amount of work. Having lots of replicas also increases the network traffic as more data need to be replicated between the nodes.

First try changing to 1 replica for all indices and see how that affects you indexing rate. I would also recommend that you install X-Pack Monitoring in order to be able to give us a better idea of what is going on if reducing the number of replicas is not sufficient.

Thank you Christian for the quick reply.
I just changed the templates to replica:1 on all indexes; will see tomorrow morning how it goes.

I do already have X-Pack Monitoring installed; what key graphs should I look at to understand how the cluster is behaving?

Thanks again,
Lorenzo

09

Today's indexes numbers and details about one of them.
What do you guys think?

Lorenzo

Are you able to keep up with the data being generated? What does GC and CPU usage look like?

With 80% traffic I have the following graphs on one of the nodes:

01

Yesterday I tried with all sources for about 30 minutes to have 100% traffic, but I saw CPU load hitting 8.0, JVM always around 70-80%:

36

I think with the current config I cannot cope with 100% traffic.
What do you think?

As far as I can see that looks pretty good.

Hm alright.
Should I set the traffic back to 100%, what are the graphs to focus on to understand if the cluster is able to keep up or is at the limit?

Is there any way to aggregate them for all nodes instead of having to look at each one separately?

Thanks Christian

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.