[NEWBIE] Increase speed of indexation huge logs

Hi everybody,

I'm french and i m a very newbie with elasticsearch.

Elasticsearch version imposed by security team : 7.10.2

I create a cluster like this with dedicate nodes:

  • 2 master node
  • 1 master only eligible node
  • 1 coordinating only node
  • 6 data node
  • 2 data WARM node
  • 2 data COLD node
  • 6 ingest nodes

I have to ingest 130 Go of many logs in one time per day of one application (application is load balanced on 32 servers) in production.

The first day was April, the 14th and the indexing was very very slowly.

Also I need to tune :wink: correctly

Cluster side :

  • Shard : 1 primary, 1 replica
  • Refresh_interval : 30 s

Filebeat side :

  • 9 input log type
  • output elasticsearch hosts :
    --> i have put all the hosts of the cluster and in first position i put the coordinating node. Is is the good way or should i put the ingest and data nodes first
  • loadbalance : true
  • bulk_max_size : 8192
    --> can i increase bulk_max_size ?
  • worker : 8
    --> can i increase ?
  • queue.mem.events :
    --> how calculate the good number ?

Number of servers :

  • should i have to increase the number of data node and ingest nodes ?

Could you help me to tune please to increase the speed of indexing ?

One more question : to be sure, which node is in charge of indexing ?

  • ingest node with the pipeline to,parse the logs and take the events
  • data nodes ? yes or no ?

Thanks a lot for the biggest newie of the community :slight_smile:

This document might help:

I would consider performance test of cluster to evaluate its maximum performance.
At first glance I would look at number of primary + replica shards, this number determines how many nodes work for you and this is one way how to scale performance.
Index mapping and its complexity plays its role and is worth to check.

Bonjour pepite.

130go de log par jour ? pendant combien de jour ? ça fait quelque chose d'énorme juste après 10 jours a devoir stocker, indexé ....

Déjà peux tu voir si TOUT le contenu de tes logs sont réellement obligatoire d'être inséré dans élasticsearch, et si oui quels sont les "champs" qui n'ont pas besoin d'être indexé.


130gb of logs per day? for how many days? it's something huge just after 10 days of having to store, indexed ....

Already can you see if ALL the contents of your logs are really mandatory to be inserted into elasticsearch, and if so which are the "fields" that do not need to be indexed.

You mentioned 6 data nodes. Does that include the 2 warm and 2 cold nodes or is it 6 hot data nodes that you are actively indexing into?

What is the specification of the nodes in terms of CPU, RAM and type and size of storage used?

Do each type of log go into its own index or do multiple log types go into the same index?

Do all data go though an ingest pipeline?

What is your required retention period for these logs?

Hello,

Thanks a lot for your answer.
I know this document, i used it for tryi,ng tuning before the prod :wink:
jvm, swap, swapinness, refresh interval :wink:

For my primary shard : One, and one replica

Hi, thanks.
Yess 130 Go per day of 32 servers during many months. My boss want all the logs yes. :wink:
In reality its the log of the day before. I retrieve an archive of log the day after.

I ingest the log xxxxxx.log-2023-04-16 in ELK the day 17th April 2024
For example :

  • the logs are created the 16th April
    An archive is done in the night of the 16th April.
    I have to ingest the log of the 16th April the 17th April, the day after also

Hi thanks
I have 6 data nodes + 2 datas WARM nodes + 2 data COLD nodes
Then 6 data nodes that i have actively indexing, it seems to be not enough no ?

THe ILM is : HOT during 2 days, WARM during 7 days and COLD and delete in the COLD but this volume will explose :wink: my data cold, that i will have to increase the number i think.

8 CPUs, 24 Go RAM, os : rocky8

Each type of log goes into their own daily index, i use index templating to create daily index :slight_smile:
as :
toto.log --> index toto-2023.04.15
boom.log --> index boom-2023.04.15

The day after :
toto.log --> index toto-2023-04-16
boom.log --> index boom-2023-04-16

Each log goes into his own ingest pipeline :slight_smile:
--> baby.log is parsing with his pipeline : baby-pipeline
--> tv.log is parsing with his pipeline : tv-pipeline

My boss want keep the index 6 months.
I use snapshot with filesystel share nfs on 1 server (i will have to increase the /var :slight_smile:

  • hourly snapshots : retention 24 h
  • daily snapshots : retention 6 months
  • monthly snapshot : retention 12 months

I will have to stop the daily snapshot i think no ?
I dont know how calculate the number of servers needed in the cluster.

For the moment,

I decrease the compression level to 1 in filebeat
I increase the number of workers (12), the max_bulk_size
I increase too the que.mem.events and a little the flush.min.events

But i dont know what happens because i m in ansible formation :wink:

After my formation, i think increase :

  • number of ingest nodes
  • number of data nodes (+ warm nodes + cold nodes)
  • test with workers, max_bulk_size, queue.mem.events

What do you think about this ?
But before, i would like tuning correctly :wink: to understand what happens :slight_smile:

What type of storage are the different nodes using? As you can see in the docs Petr linked to, storage performance can often be the bottleneck during heavy indexing.

It sounds like you are loading data once per day, which will cause uneven load. I would recommend installing Filebeat on the servers generating the logs so that they can ingest data in near realtime and avoid huge spikes.

If all data goes through ingest pipelines and you have dedicated ingest nodes, you should send all indexing requests to the dedicated ingest nodes.

As you are using Filebeat to periodically load large amounts of data, which is not how it is generally used, you may need to tweak its configuration. It is posible that it is indeed the bottleneck.

I would not change the number of nodes in the cluster until you have determined that is indeed the bottleneck. Once you send all data directly to the dedicated ingest nodes, check CPU utilisation on these nodes. If this is not maxed out it is not generally a bottleneck. For the HOT data nodes doing all indexing, check CPU usage and disk await to see if they are likely a bottleneck. Also on all nodes look out for long or frequent GC, as that can indicate the heap is too small.

Hi,

The disk'are sata disk on the hypervisor.

I cant this year, it s only for next year.

Are you using SSDs? If not, I would recommend looking at disk I/O and await stats on the HOT data nodes as using fast storage is important for high ingest throughput.

I dont understand this.
I thought the cluster elastic do this by itself no ?

Elasticsearch will reroute traffic, so you can send data to any node and it will still work. It is however more efficient to just list the dedicated ingest nodes in Filebeat and send all data directly to them as that is where it is going anyway.

Thanks, i will try this

3 more morning questions before my formation :

1- What is the correct syntax to put refresh interval in persistent mode in the cluster settings ?

Put /_cluster/settings ?

2 - Which endpoint of the API can i use to check if the current task of indexing is in current state ?
Is it possible ?

3 - I think i find the responible log.
One log : business.log haas a size between 3 and 5 Go per day per server.
His index business-2023-04-* grows between 75 and 100 Go per day.

What kind of setting can i setting up on the index pattern business-* to increase the indexing speed of this index ?

Thanks a lot evryone, i leran everyday with your answers. and sorry if my questions are idiots :wink:
Mode newbie newbie :slight_smile:

If you are using a recent version of Elasticsearch I do not think you need to do this.

What is your definition of current state?

I am not sure Elasticsearch is the bottleneck, so tuning it may make no difference whatsoever. It could just as well be Filebeat that is slow reading a single huge file. Until you have evidence Elsticsearch is the bottleneck I would concentrate on optimising Filebeat config (not my domain), especially as you are using it in a non-traditional way and not tailing the large files in near realtime.

It may be worthwhile creating a new topic in the Filebeat section asking for advice on how to best configure Filebeat to efficiently handle large files this way (where you are not constantly training it).

Before you can optimise you need to identify where the bottleneck is. I would recommend going through the following steps and post the results/findings here.

  1. Open a new thread under the Filebeat section to get advice on ideal config for your use case. That may or may not be what you already have in place.
  2. Direct all Filebeat traffic to the dedicated ingest nodes and monitor their CPU usage. If CPU usage is close to 100% on one or all nodes this could be an issue and limiting throughput. If not, this is likely not the limiting factor. Also check for long or frequent GC in the logs.
  3. Look at the HOT data nodes doing all the indexing. Indexing is most often limited by CPU or disk performance, so look at CPU usage on these nodes as well as disk I/O and await, which you can get using iostat -x on the nodes. Also check for long or frequent GC in the logs. If this all looks OK the data nodes are likely not the bottleneck.

Hi @Christian_Dahlqvist

i use version 7.10.2, the only version authorized by the security service.

I wonder how to know if the HOT data nodes are doing the indexation. If the state of the task indexation is "running", "stop", finished..
/_cat/nodes , pending tasks ? tasks ?

Yes i will follow your advice and do this now :wink:

Done :wink: the traffic is directed to the ingest nodes. I will monitor tomorrow morning

I monitor r_await, between 3 to 340 for certain nodes, do you think i have to increase jvm heap size ?
Actually it s to 10 G in a file whic is in /etc/elasticsearch

During the indexation task, when i lookup the /_cat/nodes api RAM & CPU, the HOT data nodes don't seems working together but 2 by 2. Is it the normal behavior ? Is it possible to do the HOT data nodes working together ?
I think it's due to the loadbalancing parameter in filebeat at true

I finished tomorrow morning the monitoring when i will receive the huge logs :wink:

A big thanks :wink: taking time for explaining me. I read many many many times the doc for better understanding :wink: and arriving installing xcluster with the best practices but i failed :frowning:

Hi everybody :+)

Fine ?.

I try setting alias to rollover index with conditions of size in my ilm policy but it was ko.
I try to change my index templace that is attached to all my indices with index patterns

Do i have to create one alias per index pattern to have the capabilitie of rolling over ?.
I certzinly make an error in syntax with multiple alias.
Do you know the correct syntax please ?.

Thanks

This is odd as I believe that version has security vulnerabilities and is no longer receiving any patches. It would make more sense from a security perspective to upgrade to the latest version of Elasticsearch. If the license is an issue you may instead consider going with Opensearch, although that is a different ecosystem.

What is the configuration of your 6 data nodes? What is the full output of the cat nodes API?

Can you please share the full output of iostat -x from these nodes?

What does JVM heap size has to do with this? If you look in the logs and see evidence of frequent and/or long GC the heap size may be an issue. If not there are likely other factors limiting throughput.

What is the full output of the cluster stats API?