Create 2 versions of the same index (one with less fields)

Hello,

I have the requirement of ingesting metricbeat/filebeat data twice, once with all the default fields, and a second time with only the fields the dashboards will be using.

This two versions will have different retention policies too. The goal is to save space keeping detailed logs for less time than "dashboard logs", and also to make the dashboards faster.

Migrating to Fleet is not an option atm.

The options I can think of:

A) Adding Logstash in the middle and duplicating events
B) Running metricbeat/filebeat twice with different configuration files (for Filebeat I will have to use different config and data paths to avoid blocking the registry).

This is a big infra so A will be difficult, and I'm not sure if B would affect the hosts performance in some way.

Is there an alternative to achieve this?

Thanks

I am not aware of other options. Keep in mind when you run two Filebeat/Metricbeat instances that you have to configure a different data folder for each. Otherwise, IDs and file states can interfere with each other.

Also, you could open an enhancement request on GH to have a similar processor to the existing clone plugin of Logstash.

1 Like

Thank you @kvch

This is what I found related to work in progress [Agent] Add support for multiple outputs for the Beat agent. · Issue #14445 · elastic/beats (github.com)

That issue is about Elastic Agent, not Beats. You can have multiple outputs in Elastic Agent, but it basically starts multiple Beat instances, so you are better off with running multiple instances of Beats without Agent. (Unless you need the features provided by Agent.)

I suggest different approach.

First put all the default fields in to Index-X

then run another pipeline at certain frequency and read data from Index-X grab only require fields and put it in Index-Y

This way you won't have to run multiple beats, multiple logstash and multiple pipeline

2 Likes

Also considered that.
But is there a way to do this within Elastic? Otherwise this mean to configure an external tool and we are kind of in the same place as Logstash. The short version of the index needs to be available at the same time as the extended version (live)

@kvch I was expecting to find the data folder as a flag under the run section but didnt find it:

EDIT: Just found it

./filebeat -E "path.config=shortconfig" -E "path.data=shortdata"

I wonder if you could use the reindex API for this: Reindex API | Elasticsearch Guide [8.2] | Elastic By default you would also query both indices at the same time and the reduced index is created daily.

I assume you looked into rollups? Rollup overview | Elasticsearch Guide [8.2] | Elastic And it seems what you are looking for is not only for metrics but also log data? For the log data, what kind of fields would you get rid off?

I wonder if on the query performance side, ingesting only parts of the documents will really make a difference? Have you tested this? If not, we are left with the storage savings. What is the driver here, cost?

Hello @ruflin appreciate your feedback.

I thought about running a cron script that reindex with less fields, but introducing custom tools is not something I can do.

I understand rollups are to reduce the number of fields after some time, and the whole idea is to have both versions of the data at the same time, and then keep a shorter version afterwards.

About reducing fields for performance I agree with you, I'm not sure that would increase performance.

The requirement started when after finishing the Kibana Dashboards the loading time was 15-20s, with <200ms ES queries, and Kibana nodes not stressed at all. Elastic support discovered that was a bug related to the number of mapping fields and upgrading would fix.

ES Was updated, the loading time decreased but still high. Then we asked ourselves:

  • what if we remove the unused fields from the mappings anyways
  • what if we also try removing the unused fields, that way we query dashboards against lighter data and save space.

Every query is against millions of documents, so maybe this impacts the kibana performance.

Thanks for the additional background infos. My general concern is that we potential try to optimise the wrong thing and have a complicated solution that works but is still limited.

There are currently quite a few efforts ongoing like TSDB (Support for TSDB in package spec · Issue #311 · elastic/package-spec · GitHub) or setting index: false for many fields (Set `index: false` on fields that are rarely used for filtering · Issue #3419 · elastic/integrations · GitHub) that will help saving space and should speed up queries.

Elasticsearch over the years has become much better to deal with sparse events so I would be surprised if just updating the mapping would have a large effect on storage saving and query speed.

My recommendation at the moment would be to start optimising what fields are index and how, use rollups if an option and see if this solves the problem instead of trying to ingest the data twice.

1 Like

Thanks @ruflin I will mark it as a Solution as there is a lot to be analyzed. It make sense to step back before ingesting twice and I will communicate that.

Also will let you know if I saw huge difference between dashboards with "thin" or "thick" fields index.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.