Transform: Sum doesn't add up

kaijun · November 25, 2020, 9:11am

I have an existing index datalog.gl_sep20 and I want to transform to a new index based on some fields and do a sum aggregation based on "count" field. This is how I do it:

curl -X PUT "localhost:9200/_transform/transform_job?pretty" -d'
{
  "source": {
    "index": "datalog.gl_sep20"
  },
  "dest" : { 
    "index" : "datalog.transformed_sep20"
  },
  "pivot": {
    "group_by": { 
      "domain": { "terms": { "field": "domain" } },
      "os": { "terms": { "field": "os" }},
      "time": { "date_histogram": { "field": "time", "fixed_interval": "24h"}}
    },
    "aggregations": {
      "count": { "sum": { "field": "count" }}
    }
  }
}
'

Then I start the transform job curl -X POST "localhost:9200/_transform/transform_job/_start?pretty"

The transforming does not return any error and works fine. However, when I do a sanity check on the total value of "count" on the original index and transformed index, they are not equal.
Eg original index return a count of 3731:

curl -X POST "localhost:9200/datalog.gl_sep20/_search?size=0&pretty"  -d'
{
  "aggs": {
    "total_count": { "sum": { "field": "count" } }
  }
}
'

while the transformed index return a total_count of 2724.

curl -X POST "localhost:9200/datalog.transformed_sep20/_search?size=0&pretty" -d'
{
  "aggs": {
    "total_count": { "sum": { "field": "count" } }
  }
}
'

What might be the cause in this discrepancies in the total count? Shouldn't the transformed index have the same value as the original index since it does sum aggregation on the "count" field?

Hendrik_Muhs · November 25, 2020, 10:21am

You configured your transform as so called batch transform, that means it only runs once. What you want is using the continuous mode, for that you need to configure the sync object in your configuration.

Note the delay parameter which configures the ingest delay. Continuous runs in real time, but data does not arrive in real time. It takes times until a data point with a certain value for time is searchable. This might just be a few seconds, but it can also be minutes. The right value for delay depends on who you ingest your time series data.

If you do the comparison live index vs. transformed index again, you need to take delay into account and e.g. use now-delay as range query on time.

A continuous transform by definition runs behind source, if you want to know how far behind the transform currently is, take a look at the _stats API.

kaijun · November 25, 2020, 10:29am

Hi Hendrik, thanks for your prompt response! I use batch transform because my source index is a static index, meaning there will not be any more changes to it, so I just need to transform it once to a new index. I am not sure about the motivation for using a continuous transform in my case?

If it is batch transformation, wouldn't the count value in the transformed index add up to the count value in the original index?

Mark_Harwood · November 25, 2020, 10:36am

Null domains or OS?

Hendrik_Muhs · November 25, 2020, 10:38am

ok, I misunderstood, I thought you are still feeding into it.

In this case, I guess you might have some sparse data for domain and os. I suggest to try your test again but with a transform that lacks the 2 terms group_by. If that yields equal counts, it means domain and os do not always have a value.

Per default missing values are ignored in group_by. If you use 7.10 you can add "missing_bucket": true to your terms group_by to get groupings for them (with key null).

kaijun · November 25, 2020, 10:42am

So if I have some fields which have null values, they wouldn't be transformed?
For example if I have a few documents in the original index:

{"domain":"", "os":"linux", "count":1} 
{"domain":"www.example.com", "os":"linux", "count":1}
{"domain":"www.example.com", "os":"linux", "count":1}

So the first document which contains a null wouldn't be transformed? I'm expecting the transformed index will have 2 documents like this:

{"domain":"", "os":"linux", "count":1} 
{"domain":"www.example.com", "os":"linux", "count":2}

kaijun · November 25, 2020, 10:54am

I tried out what you mentioned, indeed there are some sparse data, thanks a lot! Could you elaborate more on how to add the missing bucket:true? I couldn't see this attribute in the create transform documentation.

Is it something like this?

curl -X PUT "localhost:9200/_transform/transform_job?pretty" -d'
{
  "source": {
    "index": "datalog.gl_sep20"
  },
  "dest" : { 
    "index" : "datalog.transformed_sep20"
  },
  "pivot": {
    "group_by": { 
      "domain": { "terms": { "field": "domain" , "missing_bucket":true} },
      "os": { "terms": { "field": "os", "missing_bucket":true }},
      "time": { "date_histogram": { "field": "time", "fixed_interval": "24h"}}
    },
    "aggregations": {
      "count": { "sum": { "field": "count" }}
    }
  }
}
'

Hendrik_Muhs · November 25, 2020, 11:00am

Yes, that looks good.

Transform uses composite aggregations and the corresponding syntax.

FWIW. The alternative to using missing_bucket is using a script (available >= 7.7.0). With the script you could give the missing values a custom output key.

Our documentation is unfortunately not fully correct when it comes to the options you have in transform. I will follow up internally to improve the situation.

kaijun · November 25, 2020, 11:04am

Is the missing_bucket only available for version 7.10.0? I'm on 7.7.0 version and it returns this error

{
  "error" : {
    "root_cause" : [
      {
        "type" : "x_content_parse_exception",
        "reason" : "[1:38] [data_frame_terms_group] unknown field [missing_bucket]"
      }
    ],
    "type" : "x_content_parse_exception",
    "reason" : "[14:5] [data_frame_transform_config] failed to parse field [pivot]",
    "caused_by" : {
      "type" : "x_content_parse_exception",
      "reason" : "[14:5] [data_frame_transform_pivot] failed to parse field [group_by]",
      "caused_by" : {
        "type" : "x_content_parse_exception",
        "reason" : "[1:38] [data_frame_terms_group] unknown field [missing_bucket]"
      }
    }
  },
  "status" : 400
}

I saw one of your pull request on https://github.com/elastic/elasticsearch/pull/60390 so I assume it works for all versions of 7.x?

Hendrik_Muhs · November 25, 2020, 11:13am

Yes, missing_bucket requires >=7.10.0, however you can use a script (>=7.7.0), e.g.

"script": { 
            "source": "if (doc['domain'].value == null) {return "unknown_domain";}return doc['domain'].value;",
            "lang": "painless"
          }

Hendrik_Muhs · November 25, 2020, 11:20am

7.x means the latest dev branch in the 7 series at the time this PR was created (its the name of a branch, not a version). I can't go into detail, how we work (you might find more info on gh). What's relevant: This PR is a "backport", it references the master PR: [Transform] add support for missing bucket by hendrikmuhs · Pull Request #59591 · elastic/elasticsearch · GitHub

This PR contains the relevant version label: 7.10.0.

kaijun · November 25, 2020, 11:42am

Thanks for your help so far! I'm trying to use script, but I keep getting some json parsing errors

curl -X PUT "localhost:9200/_transform/transform_job?pretty" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "datalog.gl_sep20"
  },
  "dest" : { 
    "index" : "datalog.transformed_sep20"
  },
  "pivot": {
    "group_by": { 
      "domain": { "terms": { "script": { 
                    "source": """
                      if (doc['domain.keyword'].value == null) {
                        return "unknown_domain";}
                      return doc['domain.keyword'].value;
                    """,
                    "lang": "painless"
          } } },
      "os": { "terms": { "field": "os"}},
      "time": { "date_histogram": { "field": "time", "fixed_interval": "24h"}}
    },
    "aggregations": {
      "count": { "value_count": { "field": "count" }}
    }
  }
}
'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "x_content_parse_exception",
        "reason" : "[12:23] [data_frame_transform_pivot] failed to parse field [group_by]"
      }
    ],
    "type" : "x_content_parse_exception",
    "reason" : "[12:23] [data_frame_transform_config] failed to parse field [pivot]",
    "caused_by" : {
      "type" : "x_content_parse_exception",
      "reason" : "[12:23] [data_frame_transform_pivot] failed to parse field [group_by]",
      "caused_by" : {
        "type" : "json_parse_exception",
        "reason" : "Unexpected character ('\"' (code 34)): was expecting comma to separate Object entries\n at [Source: (org.elasticsearch.transport.netty4.ByteBufStreamInput); line: 12, column: 26]"
      }
    }
  },
  "status" : 400
}

What might be the reason?

Mark_Harwood · November 25, 2020, 11:52am

Triple quotes are a convenience in Kibana data entry but are translated into legal JSON when submitted. Raw JSON doesn't support the triple quote syntax so you have to mess around with escaping any quotes that might be inside your string.
CURL (as far as I know) doesn't do the same thing with triple quotes. In Kibana you can use the "copy as curl" function to do that escaping translation.

kaijun · November 25, 2020, 12:10pm

I see I see, sorry to keep bothering you guys, I am still facing some errors. I've escaped the string literals

curl -X PUT "localhost:9200/_transform/transform_job?pretty" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "datalog.gl_sep20"
  },
  "dest" : { 
    "index" : "datalog.transformed_sep20"
  },
  "pivot": {
    "group_by": { 
      "domain": { "terms": { "script": { "source": "if (doc[\u0027domain\u0027].value == null) {return \"unknown_domain\";}return doc[\u0027domain\u0027].value;","lang":"painless"} } },
      "os": { "terms": { "field": "os"}},
      "time": { "date_histogram": { "field": "time", "fixed_interval": "24h"}}
    },
    "aggregations": {
      "count": { "value_count": { "field": "count" }}
    }
  }
}
'

However I am getting an error

{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "script_stack" : [
          "org.elasticsearch.index.fielddata.ScriptDocValues$Strings.get(ScriptDocValues.java:496)",
          "org.elasticsearch.index.fielddata.ScriptDocValues$Strings.getValue(ScriptDocValues.java:503)",
          "if (doc['domain'].value == null) {",
          "                 ^---- HERE"
        ],
        "script" : "if (doc['domain'].value == null) {return \"unknown_domain\";}return doc['domain'].value;",
        "lang" : "painless",
        "position" : {
          "offset" : 17,
          "start" : 0,
          "end" : 34
        }
      }
    ],
    "type" : "status_exception",
    "reason" : "Failed to validate configuration",
    "caused_by" : {
      "type" : "status_exception",
      "reason" : "Failed to test query",
      "caused_by" : {
        "type" : "search_phase_execution_exception",
        "reason" : "all shards failed",
        "phase" : "query",
        "grouped" : true,
        "failed_shards" : [
          {
            "shard" : 0,
            "index" : "datalog.gl_sep20",
            "node" : "DzWa5139RgOmiWqLoGUhng",
            "reason" : {
              "type" : "script_exception",
              "reason" : "runtime error",
              "script_stack" : [
                "org.elasticsearch.index.fielddata.ScriptDocValues$Strings.get(ScriptDocValues.java:496)",
                "org.elasticsearch.index.fielddata.ScriptDocValues$Strings.getValue(ScriptDocValues.java:503)",
                "if (doc['domain'].value == null) {",
                "                 ^---- HERE"
              ],
              "script" : "if (doc['domain'].value == null) {return \"unknown_domain\";}return doc['domain'].value;",
              "lang" : "painless",
              "position" : {
                "offset" : 17,
                "start" : 0,
                "end" : 34
              },
              "caused_by" : {
                "type" : "illegal_state_exception",
                "reason" : "A document doesn't have a value for a field! Use doc[<field>].size()==0 to check if a document is missing a field!"
              }
            }
          }
        ]
      }
    }
  },
  "status" : 503
}

I'm pretty sure my source index have the field 'domain'. I've double checked and that's the case

PS: Also what's the difference between a sigle quote ' ' and a double quote " " in elasticsearch?

Hendrik_Muhs · November 25, 2020, 1:38pm

It stumbles upon the data point that has no doc value. Instead of checking for null I suggest if(doc['domain'].size()==0)

kaijun · November 25, 2020, 2:09pm

Hi @Hendrik_Muhs, it works, now my sum adds up! Thanks so much for your detailed explanation and help, really appreciate it! Have been stuck on this for quite a while and you just helped me solve it Thanks @Mark_Harwood too! You guys rocks!

Once I'm done with my work on this transform API, I think I will write a brief tutorial on Medium about transform, hope it will help others like how you guys help me

system · December 23, 2020, 2:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Understanding continous transforms syncing Elasticsearch transforms	5	4796	November 19, 2020
Transform doesn't recognize all the documents Elasticsearch	10	1875	July 26, 2020
Transform for predefined fields Elasticsearch	9	272	August 8, 2022
Error building transform on the destination index of previous transform Elasticsearch transforms	7	1287	March 18, 2021
Transform Mapping Errors Kibana	13	1612	October 12, 2022

Transform: Sum doesn't add up

Related topics