Advanced job can not be 'realtime search' with xpack5.4 machine learning

hey,gays

when i created advanced job and chose ‘no end time(realtime search)’ . if data load finish, the process will

sleep. after about 5min data keep push to elasticsearch ,but the process is not work.

help me~ thx.

Hello,

Could you please paste the job/datafeed configuration? It is a necessary first step to figure what is going on.

Thank you,
Dimitris

hi,dmitri.


datafeed configuration is default. i try to set the ’Query delay‘,'Frequency' larger. like 300s , 900s ,this job can be kept updated during each cycle.but default configuration can not work, also remind me 'Datafeed has been retrieving no data for a while'. elasticsearch index realtime receive data, i check over the index has new record.

thank you
ray

Hi Ray,

Thank you for this. In order to get the full picture, could you please paste the JSON from the JSON tab of the job?

Thank you,
Dimitris

hi dmitri.

{
    "job_id": "test4",
    "job_type": "anomaly_detector",
    "description": "test4",
    "create_time": 1496515790240,
    "analysis_config": {
        "bucket_span": "5m",
        "detectors": [
            {
                "detector_description": "high_sum(session_down) (test4)",
                "function": "high_sum",
                "field_name": "session_down",
                "detector_rules": []
            },
            {
                "detector_description": "high_sum(session_up) (test4)",
                "function": "high_sum",
                "field_name": "session_up",
                "detector_rules": []
            }
        ],
        "influencers": []
    },
    "data_description": {
        "time_field": "session_start_time",
        "time_format": "epoch_ms"
    },
    "model_snapshot_retention_days": 1,
    "model_snapshot_id": "1497965270",
    "results_index_name": "shared",
    "data_counts": {
        "job_id": "test4",
        "processed_record_count": 48131071,
        "processed_field_count": 96262142,
        "input_bytes": 3446702239,
        "input_field_count": 96262142,
        "invalid_date_count": 0,
        "missing_field_count": 0,
        "out_of_order_timestamp_count": 0,
        "empty_bucket_count": 0,
        "sparse_bucket_count": 305,
        "bucket_count": 71350,
        "earliest_record_timestamp": 1494049387000,
        "latest_record_timestamp": 1497964540000,
        "last_data_time": 1497965275678,
        "latest_empty_bucket_timestamp": 1497168300000,
        "latest_sparse_bucket_timestamp": 1497963600000,
        "input_record_count": 48131071
    },
    "model_size_stats": {
        "job_id": "test4",
        "result_type": "model_size_stats",
        "model_bytes": 85744,
        "total_by_field_count": 4,
        "total_over_field_count": 0,
        "total_partition_field_count": 3,
        "bucket_allocation_failures_count": 0,
        "memory_status": "ok",
        "log_time": 1497950997000,
        "timestamp": 1497651000000
    },
    "datafeed_config": {
        "datafeed_id": "datafeed-test4",
        "job_id": "test4",
        "query_delay": "60s",
        "frequency": "150s",
        "indexes": [
            "session*"
        ],
        "types": [
            "sessionmin_log"
        ],
        "query": {
            "match_all": {
                "boost": 1
            }
        },
        "scroll_size": 1000,
        "chunking_config": {
            "mode": "auto"
        },
        "state": "started"
    },
    "state": "opened",
    "node": {
        "id": "oBFK-X6pRd-TYm-WNa_BrA",
        "name": "xjtu-bigdata01",
        "ephemeral_id": "L5i9lVVnTEKnLE3KeHg73w",
        "transport_address": "172.16.0.11:9300",
        "attributes": {
            "ml.enabled": "true"
        }
    },
    "open_time": "1449747s"
}

I searched for the last record in the index, "session_start_time"=>"2017-06-20T21:27:37.000+08:00" , "total"=>48170388.

thank you
ray

Hi Ray,

Thank you for that.

As a next step, we need to take a look at the logs. However, ML datafeeds do not log much on the info level. Thus, we need to enable trace logging for datafeeds. You can achieve that by editing the logging configuration file which can usually be found at /path-to-elasticsearch/config/log4j2.properties.

In there, could you please add the following 2 lines:

logger.datafeed.name = org.elasticsearch.xpack.ml.datafeed
logger.datafeed.level = trace

You will need to restart your cluster in order for this change to apply.

Once that is done, could you please repeat your test and paste the output from the elasticsearch.log file?

Thank you,
Dimitris

1 Like

Another thing that would be very useful, is to double check that the time field you use in your data has the correct timezone information. Note that if you index a date field without timezone information, elasticsearch assumes it is UTC. Thus, it is possible that the datafeed skips over that data, or that it will be running a few hours behind the data's timestamp.

You could check that by running a search against your data where you add:

{
  ...
  "docvalue_fields": ["{your_time_field}"],
  ...

If you could post an example document it would be great to further understand what the issue is here.

1 Like

Ray,

I see from your configuration:

that you are using session_start_time as your time field for the job. I'm not familiar with your data, but I can imagine a situation in which when the actual document is inserted into Elasticsearch, this session_start_time field could reference a time a few minutes in the past (especially if the "sessions" that your tracking last for more than a few minutes). If that's the case, then settings currently being used for your ML job's datafeed may NEVER see your data. Here's an explanation as to why:

Let's say at 1:00:30 PM local time, a document is ingested into Elasticsearch, and the field session_start_time has a value that's equivalent to 12:49:00 PM. And let's pretend there's another field session_end_time that has a value of 1:00:00 PM (thus the session's duration was 11 minutes). (I'm presuming that the document is written and ingested once the session is over). By this logic, it also took 30 seconds from the document's creation on the device (1:00:00 PM) before it was ingested by Elasticsearch (at 1:00:30 PM). This is the ingest pipeline delay.

Let's now imagine a separate situation where at 1:00:00 PM, I query Elasticsearch for documents where session_end_time is between 12:55:00 PM and 1:00:00 PM) - a 5 minute "bucket_span". If I did this, would the document described above even exist in Elasticsearch? The answer is no, as it will only be indexed 30 seconds from now. But if I delayed my looking for that document by 60s (at 1:01:00 PM), would I find it? Well, the document would indeed exist in Elasticsearch by that time (it has existed for 30 seconds at this point), but the query itself won't find it - because the query is asking for matches on values of session_start_time being between 12:55:00 PM and 1:00:00 PM. Our document has a session_start_time of 12:49:00 PM. So, it's already/instantly outside of the range of time we're looking for!

So, if this is the possible scenario, the way around this is to increase the query_delay parameter. The default value of 60s is not enough. In the scenario I describe above, where sessions can run on the order of 10-15 minutes, I may need to bump the query_delay parameter up to perhaps 20m. Then, at 1:05:00 PM, the query will run asking for matches of session_start_time being between 12:45:00 PM and 12:50:00 PM - and it'll find my document where session_start_time is 12:49:00 PM

Hope that helps

2 Likes

hi,dmitri
thank you for help,I guess problem is time filed delayed.

i will go on check this. ^ ^

hi,richcollier
thankyou for help, nice explanation, let me know the job how to work. 66666666666!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.