I have the requirement of ingesting metricbeat/filebeat data twice, once with all the default fields, and a second time with only the fields the dashboards will be using.
This two versions will have different retention policies too. The goal is to save space keeping detailed logs for less time than "dashboard logs", and also to make the dashboards faster.
Migrating to Fleet is not an option atm.
The options I can think of:
A) Adding Logstash in the middle and duplicating events
B) Running metricbeat/filebeat twice with different configuration files (for Filebeat I will have to use different config and data paths to avoid blocking the registry).
This is a big infra so A will be difficult, and I'm not sure if B would affect the hosts performance in some way.
I am not aware of other options. Keep in mind when you run two Filebeat/Metricbeat instances that you have to configure a different data folder for each. Otherwise, IDs and file states can interfere with each other.
Also, you could open an enhancement request on GH to have a similar processor to the existing clone plugin of Logstash.
That issue is about Elastic Agent, not Beats. You can have multiple outputs in Elastic Agent, but it basically starts multiple Beat instances, so you are better off with running multiple instances of Beats without Agent. (Unless you need the features provided by Agent.)
Also considered that.
But is there a way to do this within Elastic? Otherwise this mean to configure an external tool and we are kind of in the same place as Logstash. The short version of the index needs to be available at the same time as the extended version (live)
@kvch I was expecting to find the data folder as a flag under the run section but didnt find it:
I wonder if you could use the reindex API for this: Reindex API | Elasticsearch Guide [8.2] | Elastic By default you would also query both indices at the same time and the reduced index is created daily.
I assume you looked into rollups? Rollup overview | Elasticsearch Guide [8.2] | Elastic And it seems what you are looking for is not only for metrics but also log data? For the log data, what kind of fields would you get rid off?
I wonder if on the query performance side, ingesting only parts of the documents will really make a difference? Have you tested this? If not, we are left with the storage savings. What is the driver here, cost?
I thought about running a cron script that reindex with less fields, but introducing custom tools is not something I can do.
I understand rollups are to reduce the number of fields after some time, and the whole idea is to have both versions of the data at the same time, and then keep a shorter version afterwards.
About reducing fields for performance I agree with you, I'm not sure that would increase performance.
The requirement started when after finishing the Kibana Dashboards the loading time was 15-20s, with <200ms ES queries, and Kibana nodes not stressed at all. Elastic support discovered that was a bug related to the number of mapping fields and upgrading would fix.
ES Was updated, the loading time decreased but still high. Then we asked ourselves:
what if we remove the unused fields from the mappings anyways
what if we also try removing the unused fields, that way we query dashboards against lighter data and save space.
Every query is against millions of documents, so maybe this impacts the kibana performance.
Thanks for the additional background infos. My general concern is that we potential try to optimise the wrong thing and have a complicated solution that works but is still limited.
Elasticsearch over the years has become much better to deal with sparse events so I would be surprised if just updating the mapping would have a large effect on storage saving and query speed.
My recommendation at the moment would be to start optimising what fields are index and how, use rollups if an option and see if this solves the problem instead of trying to ingest the data twice.
Thanks @ruflin I will mark it as a Solution as there is a lot to be analyzed. It make sense to step back before ingesting twice and I will communicate that.
Also will let you know if I saw huge difference between dashboards with "thin" or "thick" fields index.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.