Transform doesn't recognize all the documents

OmarDacca · June 18, 2020, 8:19am

Hi,
I created a new transform and It doesn't recognize all the documents in source index.

The source index name is user_data.
The destination index name is session_data.

The transform is so simple:
runs on the last 10m,
group by: session_id, domain
aggregations: minDate (min - date_time), maxDate (max - date_time), page (value_count - page)

sync.time.field: date_time
sync.time.delay: 120s

in other words, user_data index holds the data of the user and must have session_id,
the transform should give me the min date, max date and the number of pages grouped by session_id.

now if we try to get the cardinality of session_id with precision_threshold = 40000 (on a data from two hours ago to one hour ago), on source index, and same query on destination index, we should get ALMOST the same number of session_id, BUT I get 55,183 in the source index and 30,002 in the destination index, there is a big difference as you can see.
I said maybe because the cardinlaity doesn't give the exact number of documents but I found session_id that exist in the source index and doesn't exist in the destination index, which means the transform didn't recognize some of the documents in the source index.
and the transform details tab says index_failures = 0.

any idea ?

Hendrik_Muhs · June 18, 2020, 7:42pm

I suggest you make an experiment with a batch transform by cloning the job but not use the continuous mode and using a static range query instead of last 10m. Are you still getting count mismatches? If not, the problem is somehow related to continuous mode.

Note that in continuous mode, there is no need to limit to the last 10m, because in continuous mode, the query limits the search according to the last checkpoint. (The only reason to still use the last 10m would be, if your session ids are not truly unique)

Another reason why counts could be different is your grouping: Does domain always have a value, or can it be empty/missing? group_by only creates a bucket if both are not empty(support for missing_bucket is on the list of future enhancements).

OmarDacca · June 22, 2020, 10:14am

Sorry for the late response.

I tried a batch transform as you suggested and here is the results:

I suggest you make an experiment with a batch transform by cloning the job but not use the continuous mode and using a static range query instead of last 10m. Are you still getting count mismatches? If not, the problem is somehow related to continuous mode.

this one reason why I get count mismatches.
I had documents with empty/missing domain field.

But apparently its not the only reason, I ran a batch transform and I didn't get count mismatches.
I ran the same transform but this time in continuous mode and I got count mismatch, so as you said the problem is somehow related to continuous mode.

And I think I misunderstood something regarding limiting to the last 10m , I ran the continuous transform without limiting to that last 10m and the cluster went down, and I think the reason is because the transform started to process all the data in the index, more than 300 million documents (actually the source index is an alias to multiple indices), and that is why I limited to the last10m, and after the first checkpoint the 10m will be useless since it is in continuous mode.

Hendrik_Muhs · June 23, 2020, 12:58pm

My suggestion was, instead of using now-10m you could use something like

"gte": "2020-06-01T00:00:00"

(depends on the mapping of your timestamp field, you might have to specify format)

Later, after the transform has been started, you can remove the query with the _update API. This is a known usage problem, we plan to make this easier at some point in the future. If you are interested in transforming all you historic data: 7.8 brings throttling, to reduce the load on the cluster. This will slow down the transform but prevent performance issues on the cluster.

Have you tried increasing the delay? Note, that the problem might also be sporadic, e.g. runtime spikes. Therefore an alternative would be, to move the timestamp setter closer to the end in your logstash or ingest pipeline.

Regarding your null values: You can use a set processor with override = false to avoid missing buckets, logstash should have similar functionality.

Hendrik_Muhs · June 23, 2020, 3:15pm

Can you also let me know which version you are using?

OmarDacca · June 23, 2020, 3:29pm

Regarding the null values, I set 'none' as default value so now I don't really have empty/missing values.

I did what you suggested with 15m delay, unfortunately, I'm still getting the count mismatches.

its definitely something related to the continuous mode since the same transform as batch gives accurate results - without count mismatches.

OmarDacca · June 23, 2020, 3:30pm

7.7 and we should upgrade to 7.8 today.

Hendrik_Muhs · June 24, 2020, 2:40pm

I am a bit out of ideas. In order to debug this issue, it might help to turn on trace logging for the transform:

PUT /_cluster/settings
{
   "transient": {
      "logger.org.elasticsearch.xpack.transform.transforms": "trace"
   }
}

You configured timestamp field date_time. What is this field based on? Is this the ingest timestamp?

OmarDacca · June 25, 2020, 6:31pm

Everything looks good in the logs.
The date_time field is not the ingest timestamp, it just specifies when an action occurred (there are more fields in the original transform but I'm investigating the problem with less fields - as specified above - just to make things simpler).
I suspected that the problem is because of the 3min delay but I ran another transform with 15min and 20min delay and had the same results.

Although, I noticed a strange behavior with this transform: (same source and destination as above but with different names)

{
  "id": "test_3",
  "source": {
    "index": [
      "alias_all_hits*"
    ],
    "query": {
      "range": {
        "date_time": {
           "gte": "2020-06-25T08:30:53"
       }
      }
    }
  },
  "dest": {
    "index": "test_3_session"
  },
  "sync": {
    "time": {
      "field": "date_time",
      "delay": "15m"
    }
  },
  "pivot": {
    "group_by": {
      "wz_session": {
        "terms": {
          "field": "wz_session"
        }
      },
      "domain": {
        "terms": {
          "field": "domain"
        }
      }
    },
    "aggregations": {
      "sessionPageViewsCount": {
        "value_count": {
          "field": "page"
        }
      },
      "minDate": {
        "min": {
          "field": "date_time"
        }
      }
    }
  },
  "version": "7.7.0",
  "create_time": 1593092653859
}

I ran it and after an hour, I checked if I still get the count mismatches using these two queries

the first query gives how many sessions the source index has:

GET alias_all_hits*/_search
    {
      "size": 0,
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "date_time": {
                  "gte": "2020-06-25T08:30:53",
                  "lte": "2020-06-25T08:40:53"
                }
              }
            }
          ]
        }
      },
      "aggs": {
        "sessions": {
          "cardinality": {
            "field": "wz_session",
            "precision_threshold": 40000
          }
        }
      }
    }

the second query gives how many sessions the destination index has:

GET test_3_session/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "minDate": {
              "gte": "2020-06-25T08:30:53",
              "lte": "2020-06-25T08:40:53"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "sessions": {
      "cardinality": {
        "field": "wz_session",
        "precision_threshold": 40000
      }
    }
  }
}

Now notice that in the second query, the date time in the gte is the exact same as the transform gte, and in this case I DO NOT get count mismatches, no matter what is the lte.
But if I add even one second to the gte in the second query (and for sure I update the gte in the first query), I get count mismatches.

I tried to run another exact transform in continuous mode, and changed the gte with match_all in the query range by calling the _update transform api, and I got count mismatches.

I will continue to do more tests, and will update you.
BTW, in the logs I see only regular info, debug logs, and most of them are too big queries (partial that I can't run), but anyway if any of these logs can help you understand what is happening I surely can paste them here.

OmarDacca · June 28, 2020, 6:51pm

Thanks for the help @Hendrik_Muhs,
I solved the problem.
Two things caused this issue:

As you mentioned before in your first reply, one of the group by fields was missing in some cases.
Something internally in my backend, not related to the transform.

system · July 26, 2020, 6:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Transform missing documents in continuous mode Elasticsearch	0	23	October 11, 2024
Transform data mismatch with source index Elasticsearch transforms	3	534	August 12, 2022
Transformed index is missing data Elasticsearch transforms	4	557	March 23, 2021
Transform behavior with deleted documents Elasticsearch	5	798	August 5, 2020
Transform not synching Elasticsearch transforms	9	1813	January 19, 2022

Transform doesn't recognize all the documents

Related topics