Large log data indexing best practices (datastreams?)

Hi there!
I've been working to refine my Elastic instance for months now, and have gotten a bit tangled in the details. :face_with_spiral_eyes:
Before promoting the PoC into a live Production environment, I could use a more expert opinion on if I'm doing things right or wrong.

This use-case involves ingesting a TON of data, multiple TBs/day, from a containerized application via Filebeat->Logstash.
At first we were using dynamic mapping, and split up the data by Container name, to keep like-data together.¹
However, some containers auto-generate fieldnames and we quickly hit the case of too-many-fields; so we switched to a purely explicitly-mapped approach.²

These are all log files, so I decided to use Data Streams to manage ILM and rollover. The source dynamically creates ~100 data streams (by container type). ³
We're currently using the recommended defaults, (50GB shards or 20Days) but my guess is they're not the best fit for this use-case... There would be dozens of rollovers per day....⁴

This project has had a lot of iterations and growing pains :sweat_smile:

  • Should we just go back to a single stream for all data?
    From a mapping perspecive² it seems like there isn't much benefit to splitting them up any more, but given the size of the ingest⁴ it may not matter?
  • Are Data Streams still a reasonable way to manage ILM in this use-case?
  • We don't know the exact scale of end result.
    Should we try to limit the size of indexes/shards, or just do a time-based rollover? Any recommendations on the sizing?
  • For ² - is there any way to regain search capabilities on unmapped fields?

I'd love to know your thoughts, and really appreciate any guidance.

Thanks!

Can you share some examples of your data? It is not clear what you mean by container name and what your data looks like, is this related to custom applications?

Creating datastreams per container is in my opinion a bad practice as this can lead to multiple indices, you should group the logs per dataset, but it is not clear what are the source of your logs and their format. For example, if you have the same application running on multiple containers, than you should use one data stream for this application as it is expected for the log format to be the same.

Also, you should also try to use ECS and normalize field names and types, for example, one application may use source_ip for a field with the source ip of an event, an other application my use src_ip for the same thing, so you should use tranform this into source.ip, which is the ECS field for this kind of information.

I would say that data streams is the way to go, you would use normal indices only if your use case requires it (you need to update documents for example), for any time based kind of data data streams makes the management easier.

1 Like

Ah, sorry for the ambiguity!
It is one data stream for each type of application container, each different image. So one DS for postgresql, one for redis, one for calico-node etc. Just counted - there's 84 types.
I'm not heavily involved in the source/end use of the logs, so I'm not sure what other datasets they might be broken into.
It's a vendor-provided solution, so unlikely we can change much in the source...

I'm leaning towards putting everything into one big Data Stream and then tweaking the rollover. Thoughts?


For the data structure, we do control the Logstash receiver, could I just enable ecs_compatability on the output plugin?
Everything is from k8s, and we're currently only mapping the metadata.

Current Logstash output config:

Logstash Output
      if [container_name] {
            elasticsearch {
               hosts => ["<HOST>:9200"]
               index => "app_%{container_name}"
               action => "create"
         }
      }

Altho after reviewing the docs, these are probably better:

With data_stream
      if [container_name] {
            elasticsearch {
               hosts => ["<HOST>:9200"]
               data_stream => "true"
               data_stream_dataset => "%{container_name}"
        ???->  ecs_compatability => "v8" 
         }
      }

Here are some examples of the data.
Let me know if a different format/more logs would be helpful!

Examples
{
   "@timestamp": ["2025-02-19T18:49:42.525Z"],
   "@version": ["1"],
   "cluster_name": ["apc05se1shcc"],
   "container_name": ["depi"],
   "event.original": [
     "Connection didn't change since previous snapshot. CONTINUE..."
   ],
   "File": ["src/cosdepid/CosDepiApplication.cpp"],
   "Function": ["HeartbeatStandby"],
   "Line": ["441"],
   "message": ["Connection didn't change since previous snapshot. CONTINUE..."],
   "node_name": ["98.120.32.42"],
   "orchestrator.cluster.name": ["kubernetes"],
   "orchestrator.cluster.url": ["https://98.120.32.42:443"],
   "Package": ["cos-depi"],
   "pod_name": ["vcmts-cd-18-2"],
   "pod_namespace": ["default"],
   "pod_uid": ["995d157d-8c22-4a0c-8162-208fc2fee8c1"],
   "Severity": ["TRACE"],
   "site_name": ["charter"],
   "tags": ["beats_input_codec_plain_applied"],
   "type": ["vcmts"],
   "version": ["3.21.11.500-5"],
   "Version": ["1.1-292.5"],
   "_id": "b3CLH5UB8MS8B30fYMHx",
   "_index": ".ds-vcmts1_depi-2025.02.13-000002",
   "_score": null
 }
{
  "@timestamp": ["2025-02-19T18:49:42.525Z"],
  "@version": ["1"],
  "cluster_name": ["apc04se1shcc"],
  "CmMacAddress": ["84:0b:7c:7c:88:64"],
  "container_name": ["mulpi"],
  "Event": ["CM_CTRL_OUDP_MODEM_NOT_FOUND"],
  "event.original": ["Modem not online for mac-domain"],
  "File": ["../../src/cm/CmController.cpp"],
  "Function": ["FillCmsOudpDetailsPerMd"],
  "InternalStatusCode": ["3"],
  "Line": ["4773"],
  "Logger": ["ulcmulpid.CmController.0x17001000"],
  "MdId": ["0x17001000"],
  "message": ["Modem not online for mac-domain"],
  "node_name": ["24.28.220.76"],
  "orchestrator.cluster.name": ["kubernetes"],
  "orchestrator.cluster.url": ["https://24.28.220.76:443"],
  "Package": ["ulc-mulpi"],
  "pod_name": ["vcmts-cd-2-1"],
  "pod_namespace": ["default"],
  "pod_uid": ["47b00dce-2fe7-4a9a-9fa0-8f4dcd631df9"],
  "Role": ["active"],
  "Severity": ["ERROR"],
  "site_name": ["charter"],
  "tags": ["beats_input_codec_plain_applied"],
  "type": ["vcmts"],
  "UcId": ["0x2a"],
  "version": ["3.21.11.500-5"],
  "Version": ["1.48-569.43"],
  "_id": "enGLH5UB8MS8B30fZQbl",
  "_index": ".ds-vcmts1_mulpi-2025.02.19-000007",
  "_score": null
}
{
  "@timestamp": ["2025-02-19T18:49:42.524Z"],
  "@version": ["1"],
  "cluster_name": ["apc05se1shcc"],
  "container_name": ["sched"],
  "endMinislot": ["79"],
  "event.original": ["BcmUpstreamMapOfdma::StartLeakageTestHs"],
  "File": ["../../src/bcm/Scheduler/src/UpstreamMapOfdma.cpp"],
  "fMslotsPerFrame": ["237"],
  "fNumFramesInMap": ["0"],
  "Function": ["StartLeakageTestHs"],
  "grantsPerSid": ["8"],
  "Line": ["1889"],
  "message": ["BcmUpstreamMapOfdma::StartLeakageTestHs"],
  "node_name": ["98.120.32.44"],
  "orchestrator.cluster.name": ["kubernetes"],
  "orchestrator.cluster.url": ["https://98.120.32.44:443"],
  "Package": ["ulc-scheduler"],
  "perSidDurationUsec": ["2880"],
  "pod_name": ["vcmts-cd-4-0"],
  "pod_namespace": ["default"],
  "pod_uid": ["4c4e130a-e084-4f9a-97d9-db023d5b4ef0"],
  "requestId": ["56847829"],
  "Severity": ["DEBUG"],
  "site_name": ["charter"],
  "startMinislot": ["74"],
  "tags": ["beats_input_codec_plain_applied"],
  "testEndTgc": ["17817508040250828"],
  "testStartTgc": ["17817508040132864"],
  "type": ["vcmts"],
  "version": ["3.21.11.500-5"],
  "Version": ["1.53-80.6"],
  "_id": "j3GLH5UB8MS8B30fa1U7",
  "_index": ".ds-vcmts1_sched-2025.02.19-000040",
  "_score": null
}
{
  "@timestamp": ["2025-02-19T18:49:42.533Z"],
  "@version": ["1"],
  "cluster_name": ["APC06K1SACC"],
  "CmMacAddress": ["3c:2d:9e:d6:c7:d4"],
  "container_name": ["mulpi"],
  "Event": ["CM_CTRL_OUDP_MODEM_NOT_FOUND"],
  "event.original": ["Modem not found for mac domain"],
  "File": ["../../src/cm/CmController.cpp"],
  "Function": ["FillCmsOudpDetailsPerMd"],
  "InternalStatusCode": ["2"],
  "Line": ["4739"],
  "Logger": ["ulcmulpid.CmController.0x1000000"],
  "MdId": ["0x1000000"],
  "message": ["Modem not found for mac domain"],
  "node_name": ["71.85.84.186"],
  "orchestrator.cluster.name": ["kubernetes"],
  "orchestrator.cluster.url": ["https://71.85.84.186:443"],
  "Package": ["ulc-mulpi"],
  "pod_name": ["vcmts-cd-0-0"],
  "pod_namespace": ["default"],
  "pod_uid": ["ab179469-28aa-4bd2-b0cc-b17f3e527d5e"],
  "Role": ["active"],
  "Severity": ["ERROR"],
  "site_name": ["charter"],
  "tags": ["beats_input_codec_plain_applied"],
  "type": ["vcmts"],
  "UcId": ["0x2a"],
  "version": ["3.21.7.0-1-auto32"],
  "Version": ["1.48-569.35"],
  "_id": "anGLH5UB8MS8B30fZgys",
  "_index": ".ds-vcmts1_mulpi-2025.02.19-000007",
  "_score": null
}