Best practice to denormalize array objects

We're working on a new data structure to display in Kibana and we've got the " Objects in arrays are not well supported" warning. We also found that creating queries on this data structure does not seem straightforward.

We're looking at "project usage" on a given period and we are recording the number of times a particular artifact has been used. To be able to create a trend, we also denormalize the data
to know the total usage, the usage per major generation and the usage per minor generation.

We'd like to have a pie chart for the major generation and a graph with multiple lines for the the minor generations (where they would display the share of each related to the total, see totalCount).

What is the recommendation to structure this information in the index? The versions are not known in advance so an array sounds like an obvious choice for this.

{
  "_index": "projects",
  "_type": "download",
  "_id": "AWeOZZcxe8fT5tlzG2HU",
  "_version": 1,
  "_score": null,
  "_source": {
    "from": 1541030400000,
    "to": 1543622399000,
    "projectId": "test",
    "groupId": "com.example",
    "artifactId": "acme",
    "totalCount": 170,
    "majorGenerations": [
      {
        "name": "0",
        "count": 10
      },
      {
        "name": "1",
        "count": 160
      }
    ],
    "minorGenerations": [
      {
        "name": "0.8",
        "count": 10
      },
      {
        "name": "1.0",
        "count": 60
      },
      {
        "name": "1.2",
        "count": 100
      }
    ],
    "stats": [
      {
        "version": "0.8.0",
        "count": 10
      },
      {
        "version": "1.0.0",
        "count": 20
      },
      {
        "version": "1.0.1",
        "count": 40
      },
      {
        "version": "1.2.0",
        "count": 100
      }
    ]
  }
}
1 Like

It looks like you've already aggregated your data before indexing it into Elasticsearch.

Could you instead index a document for each "usage"? It would contain info like projectId, groupId, and also contain the majorGenerationName, minorGenerationName, statsVersion, etc. Then you could do terms aggregations over projectId/groupId/artifactId and do a count for each of them.

Hi Lukas, thanks a lot for your reply.
We're unfortunately gathering data from Maven central statistics and the stats array is basically the raw data we're getting (it's already aggregated).

We were thinking, as a middle ground solution, to duplicate things a bit like:

{ "projectId" : "test", "count" : 25, generation: "version", "name" : "1.0.1.RELEASE"}
{ "projectId" : "test", "count" : 25, generation: "version", "name" : "1.0.1.RELEASE"}
{ "projectId" : "test", "count" : 50, generation: "minor", "name" : "1.0"}
{ "projectId" : "test", "count" : 50, generation: "minor", "name" : "1.1"}
{ "projectId" : "test", "count" : 100, generation: "major", "name" : "1"}

This avoids the nested documents issue, but it effectively duplicates "usage" data because "major" sums up all related "minor", which sums up all related "version". So unless we filter on a specific generation, we're any metric aggregation would be wrong.

Another solution would be to really apply what you said and have one document per "usage", but this feels weird since we would get millions of totally similar documents each month, with no difference at all since they all share the same date (maven central provides monthly stats only).

Sorry for turning this question into an elasticsearch mapping question, but we're trying to do right by both elasticsearch and kibana here.

Thanks!

Answering our own question here.

Indeed, avoiding nested types and un-aggregating data where possible is the key. I guess that's what meant @lukas in the first place and I didn't get it!

With something like the following, we're avoiding duplicates and sum on the count field and using aggregations per "id", "major" or "minor" on a given project.

{ "projectId" : "test", "count" : 25, "version": { "id": "1.0.1.RELEASE", "major": "1", "minor": "1.0"}}
{ "projectId" : "test", "count" : 32, "version": { "id": "1.0.2.RELEASE", "major": "1", "minor": "1.0"}}
{ "projectId" : "test", "count" : 12, "version": { "id": "1.2.5.RELEASE", "major": "1", "minor": "1.2"}}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.