Feedback on tuning Data Streams for my use case

I'm working on a use where we need to copy messages from 2 Kafka topics to Elasticsearch, in order to search and visualize data in Web App.

I have setup Elastic Cloud on Kubernetes with Elasticsearch and Kibana, and performed quick PoC by using ordinary index without aliases and ILM.

Now I want to take the PoC further and create production ready Elastic stack and since Im working with time series data, I thought Data Streams would be a good fit for my use case.

I have created an overview of my current cluster setup and characteristics for Document A and B in diagram below. Document B contains GPS position + some metadata for an vehicle at given point of time, Document A contains trip data for vehicle and will be used in search field. When trip is selected map is visualizing all GPS positions for vehicle by retrieving all related Document B (based on a ref value).

I would like feedback on following questions:

  • Since charasteristics and payload of Document A and B are different Im thinking on separating those on two different Data Streams/Indices. I guess you agree?
  • How many shards should I have for Data Streams for Document A and B?
  • Given data retention for last 30 days, should I have rollover pattern per day (one index per day) for both Data Stream or use different approach for each one?
  • Any input on other cluster related settings are also welcome!

The Web app will be used by limited amount of users (2-5 users). Replication is set to 0 since we can afford loosing data if one Node gets broken.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.