New logsdb format

Hello everybody.

A few weeks ago I've set up a new Elasticsearch cluster to hold logs coming from our Kubernetes cluster.

Since we installed a brand new 8.17 we decided to give the new logsdb format a go.

We started with a new basic setup with a week data persistence, standard rollout at 50GB and 1 partition and 1 replica, pushing data to a data stream.

Everything seemed to go pretty smoothly at the beginning but after a couple of day we started to have some kind of task-storm starting when an index was being rolled. The node holding the primary partition of the index being rolled would to almost 100% CPU (80% actually), the disks would go 100% and everything would totally stop for many hours. Usually 8 to 10 hours during which the cluster stopped ingesting and stopped answering to any query.

We tried to tweak various settings (disabling dynamic indexing, reducing the amount of indexed fields, removing the replica partition...) without any particular improvement.

After about a week we gave up, we just switched back from logsdb to stardard index format and the cluster started working as it should, with CPUs never going over 15% and disk acting... well... normally.

This is a production cluster so we can't really change settings to reproduce the issue again but I can share some of the things we saw when the disaster was happening :slight_smile:

The cluster was overwhelmed with pending _tasks with any node reporting 8/10k pending tasks, mostly related to index commits (sorry I can't remember the exact task description).

The very-root cause may be that the cluster doesn't have very high speed disk but commodity ones - yet the impact we saw (8 hours to close an index) doesn't seem proportional to the fact that with a standard index disk are never working more than 50%.

The biggest problem I think is the fact that the whole cluster stopped working when this activity was going on (even other indexes) - no ingestion, no search.

Does anyone have any similar experience on this?

Can we expect some kind of improvement on this topic in next releases?

Thanks.

1 Like

Welcome and thanks a lot for sharing your story.

Let me share it with the team who will be able to give more feedback.

Everything seemed to go pretty smoothly at the beginning but after a couple of day we started to have some kind of task-storm starting when an index was being rolled.

Could you share some more details about your ILM configuration? Are you maybe using a force merge action? Merges in LogsDB are more expensive as it involves index sorting. This sorting is essential for reducing the storage footprint but does add some overhead, especially when force-merging down to one segment.

1 Like

Hi Felix.

This is the whole Template definition. The only thing we changed is logsdb to standard and everything started working smoothly:

{
  "template": {
    "settings": {
      "index": {
        "mode": "standard",
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_hot"
            }
          }
        },
        "refresh_interval": "30s",
        "number_of_replicas": "0"
      }
    },
    "mappings": {
      "dynamic": "false",
      "_data_stream_timestamp": {
        "enabled": true
      },
      "dynamic_templates": [],
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "container": {
          "properties": {
            "image": {
              "properties": {
                "name": {
                  "type": "text"
                }
              }
            }
          }
        },
        "host": {
          "properties": {
            "hostname": {
              "type": "keyword"
            }
          }
        },
        "kubernetes": {
          "properties": {
            "container": {
              "properties": {
                "name": {
                  "type": "keyword"
                }
              }
            },
            "namespace": {
              "type": "keyword"
            },
            "node": {
              "properties": {
                "name": {
                  "type": "keyword"
                }
              }
            },
            "pod": {
              "properties": {
                "name": {
                  "type": "keyword"
                }
              }
            }
          }
        },
        "log": {
          "properties": {
            "offset": {
              "type": "unsigned_long"
            }
          }
        },
        "message": {
          "type": "text"
        }
      }
    },
    "aliases": {},
    "lifecycle": {
      "enabled": true,
      "data_retention": "7d"
    }
  }
}

We don't have any specific Index Lifecycle Policy, it's just applying a standard 7 days retention through a Data Stream.

As I said it was a new installation, from scratch, and we just next-next-next-ed the proposed defaults.

Just a note: in the initial setting, the problematic one, the replica was 1 as per default and the refresh_interval wasn't set.

Hi @skeyby Welcome to the community and thanks for sharing.

Couple more questions.

What license level are you working with?

Can you share a little about your hardware as well?

CPU / RAM and Storage Type?

How many Logs are being Rolled over a Day ?

the disks would go 100% and everything would

100% IO / Storage Capacity?

I am a bit with Felix on this, I have seen clusters fall behind on the force merge and then not be able to catch up ... either because of disk capacity or not enough CPU... or both...

We're using the Community version of ES.

By disk at 100% I mean 100% load.

The Cluster is based on three physical nodes, with 20 Cores and 64 GB RAM + 2 TB storage on a RAID 5 ext4 filesystem. Disks are not SSD, but fast HDD.

We're rolling more or less 25 indexes per day with about 50 million loglines each and approx 55 gb per index.

I don't exactly know how could we enable the force merge action, but I'm pretty sure we didn't enable intentionally, as I said, I left all the default parameters on.

Let me share some graph for one of the nodes of the cluster in the last past 15 days. In the first 7 days we were trying to get logsdb to work, in the past 7 we just changed the logsdb to standard index mode in the template.

I'm also very confused by the enormous bandwidth the nodes were using.

Hope this helps you to understand the situation.

Here you can find the disk load in the past 15 days

And finally here's the bandwidth graph for the same time frame. As you can see nodes were exchanging an huge amount of datas during the high cpu/disk load.

you can use GET /index-name/_segments for a logsdb and standard index to see the number of segments per index now.

The network usage is most interesting from an already interesting puzzle.

@skeyby Can you share your ILM policy please?

Rolling 25 indices / Day on three nodes, not trivial depending on the distribution.... seems fine without logsdb . Are they all single Primary Shard?

Smart index sorting improves storage efficiency by up to 30% and reduces query latency on some logging data sets by locating similar data close together. By default, it sorts indices by host.name and @timestamp. If your data has more suitable fields, you can specify them instead.

Do you have a host.name field and if so do you have sense of the cardinality?

Something is hinting at me it is some combination of the sorting and the HDD not keeping up... the sorting happens at index time.

If I'm not wrong the OP is not using ILM, but the new data stream lifecycle as this part of the template shows:

2 Likes

NOW with standard indexing, we have 20-30 segments per index when the index is rolled and about 150 on the currently "append" index.

When we were trying logsdb the number of segment when the index was finally rolled was about 50 but the currently appending was far far higher, like 400 / 500 segments.

Yes, we have am host.name field, that is the Kubernetes node. The cardinality is pretty low, we have something like 10 nodes.

At first with logsdb we were using the standard 1 partition + 1 replica setting, then we dropped the replica with basically no changes in the problem we were facing.

I forgot to mention we were on 8.17.3 when the problem happened.

thanks for the info.

It does seem your case highlights that the logsdb format works quite differently, with different IO patterns, and is also more sensitive to any IO constraints. That would be a "good to know", if confirmed.

When would such (over?) sensitivity reach level of a bug? Not sure, maybe only if enough customers complain about it :slight_smile:

@skeyby would you be able to run a hot threads and share this here while CPU consumption is high? (GET /_nodes/hot_threads)

^^^ This I believe is key

If I am interpreting this correct the data stream backing index that is the current "append to / write to" index has 400-500 segments ... but after rolling over and normal / automatic merging it is back down to ~50 segments.

Merging can be slow(er) / more expensive(er) on HDD.

So I suspect what is happening is that many segments are being created as part of the Logsdb writing strategy with the HDD (in order to keep up/sort etc...) and then there is a lot of work that needs to be done to reorganize and merge as part of the normal merging process, and this is causing the CPU and IO pressure.

Why so many segments on the write? Not sure why so many segments that is deeper than my understanding. Also not sure if there are some setting you could use to help mitigate that.

We would need someone like @mvg or @felixbarny or @DavidTurner to probably explain that.

Of course I could be wrong :slight_smile: Lets see what the experts say.

1 Like

Hi Martjin,

unfortunately this is a Production cluster, I cannot switch back to a non-working setup... sorry :frowning:

@skeyby

Do you have a Dev / Test cluster with similar HW profile? That you could trying ingesting on?

Pretty sure the issue is related to what I "theorized" above...

I have had issues with "merging" falling behind on other clusters... but I did not realize that could be potentially caused by the Logsdb "strategy"