Best structure for easy/efficient searching of complex data

I'm working on exporting result data from our configuration management system into Elasticsearch, so we can more easily understand failures across our fleet. Each run is composed of many "states" that are applied to a machine.

Currently, I've been structuring the data that is sent to ES like this:

{
  <host metadata>,
  "jid": <job id>,
  "state": {
    "id": <state id>,
    "success": true,
    ...
  }
}

However, this structure makes it tricky to answer questions like "which states failed on the most recent run for each host", since it requires grouping first by job id, and then by host (which I'm not 100% sure how to do, and is definitely harder when I don't have access to the Elasticsearch DSL).

I've also considered using the nested datatype, so I could have something more like this:

{
  <host metadata>,
  "jid": <job id>,
  "states": [
    {
      "id": <state id>,
      "success": true,
      ...
    },
    ...
  ]
}

However, I've read that these have a (soft?) limit of 10,000 items in a single list, which we could possibly hit since we currently have ~3,000 states in a single run. How should I structure this? Is there a schema that will make queries significantly more performant, or will be better for easy querying?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.