Best practice to denormalize array objects

snicoll · December 20, 2018, 9:13am

We're working on a new data structure to display in Kibana and we've got the " Objects in arrays are not well supported" warning. We also found that creating queries on this data structure does not seem straightforward.

We're looking at "project usage" on a given period and we are recording the number of times a particular artifact has been used. To be able to create a trend, we also denormalize the data
to know the total usage, the usage per major generation and the usage per minor generation.

We'd like to have a pie chart for the major generation and a graph with multiple lines for the the minor generations (where they would display the share of each related to the total, see totalCount).

What is the recommendation to structure this information in the index? The versions are not known in advance so an array sounds like an obvious choice for this.

{
  "_index": "projects",
  "_type": "download",
  "_id": "AWeOZZcxe8fT5tlzG2HU",
  "_version": 1,
  "_score": null,
  "_source": {
    "from": 1541030400000,
    "to": 1543622399000,
    "projectId": "test",
    "groupId": "com.example",
    "artifactId": "acme",
    "totalCount": 170,
    "majorGenerations": [
      {
        "name": "0",
        "count": 10
      },
      {
        "name": "1",
        "count": 160
      }
    ],
    "minorGenerations": [
      {
        "name": "0.8",
        "count": 10
      },
      {
        "name": "1.0",
        "count": 60
      },
      {
        "name": "1.2",
        "count": 100
      }
    ],
    "stats": [
      {
        "version": "0.8.0",
        "count": 10
      },
      {
        "version": "1.0.0",
        "count": 20
      },
      {
        "version": "1.0.1",
        "count": 40
      },
      {
        "version": "1.2.0",
        "count": 100
      }
    ]
  }
}

lukas · December 20, 2018, 6:29pm

It looks like you've already aggregated your data before indexing it into Elasticsearch.

Could you instead index a document for each "usage"? It would contain info like projectId, groupId, and also contain the majorGenerationName, minorGenerationName, statsVersion, etc. Then you could do terms aggregations over projectId/groupId/artifactId and do a count for each of them.

Brian_Clozel · December 20, 2018, 8:15pm

Hi Lukas, thanks a lot for your reply.
We're unfortunately gathering data from Maven central statistics and the stats array is basically the raw data we're getting (it's already aggregated).

We were thinking, as a middle ground solution, to duplicate things a bit like:

{ "projectId" : "test", "count" : 25, generation: "version", "name" : "1.0.1.RELEASE"}
{ "projectId" : "test", "count" : 25, generation: "version", "name" : "1.0.1.RELEASE"}
{ "projectId" : "test", "count" : 50, generation: "minor", "name" : "1.0"}
{ "projectId" : "test", "count" : 50, generation: "minor", "name" : "1.1"}
{ "projectId" : "test", "count" : 100, generation: "major", "name" : "1"}

This avoids the nested documents issue, but it effectively duplicates "usage" data because "major" sums up all related "minor", which sums up all related "version". So unless we filter on a specific generation, we're any metric aggregation would be wrong.

Another solution would be to really apply what you said and have one document per "usage", but this feels weird since we would get millions of totally similar documents each month, with no difference at all since they all share the same date (maven central provides monthly stats only).

Sorry for turning this question into an elasticsearch mapping question, but we're trying to do right by both elasticsearch and kibana here.

Thanks!

Brian_Clozel · December 21, 2018, 2:39pm

Answering our own question here.

Indeed, avoiding nested types and un-aggregating data where possible is the key. I guess that's what meant @lukas in the first place and I didn't get it!

With something like the following, we're avoiding duplicates and sum on the count field and using aggregations per "id", "major" or "minor" on a given project.

{ "projectId" : "test", "count" : 25, "version": { "id": "1.0.1.RELEASE", "major": "1", "minor": "1.0"}}
{ "projectId" : "test", "count" : 32, "version": { "id": "1.0.2.RELEASE", "major": "1", "minor": "1.0"}}
{ "projectId" : "test", "count" : 12, "version": { "id": "1.2.5.RELEASE", "major": "1", "minor": "1.2"}}

system · January 18, 2019, 2:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fixing Mapping for Objects in Array (objects in arrays are not well supported) Kibana	6	2808	August 7, 2019
Noob help with Kibana, Mappings & Nested Objects in Arrays Kibana	17	3674	December 1, 2017
Objects in arrays are not well supported Kibana	3	5895	April 6, 2018
Dealing with array data Kibana	4	14747	July 6, 2017
Datastructure relational data Elasticsearch	8	461	June 27, 2020

Best practice to denormalize array objects

Related topics