Understanding skipped documents in Data Frame Analytics

davemoore · June 25, 2020, 9:44pm

When running a Data Frame Analytics job in 7.8.0, documents with unsupported values are skipped (source).

Suppose I run a job that fails after the loading_data phase because every document was skipped, and the data_counts of the job is reported as:

{"training_docs_count":0,"test_docs_count":0,"skipped_docs_count":1000}

Is there a way to troubleshoot why a document was skipped during the job?

dmitri · June 27, 2020, 8:17am

Hi Dave,

Docs are skipped either because included fields have missing values and the analysis doesn't support missing values (ie. outlier_detection), or because they have a value the job cannot analyze (ie. an array with more than 1 element).

You can also use the Explain API to see exactly which fields are included in the analysis and then check for which of them contain missing values (if you're using outlier_detection only) or arrays in your source index.

davemoore · June 30, 2020, 4:28pm

Thanks @dmitri. I resolved the issue, and I believe it was caused by the number of analyzed fields in the documents.

My documents had 1,260 analyzed fields: one field for each 2-character permutation of the string 0123456789abcdefghijklmnopqrstuvwxyz. Each field was a numerical count of the bigram (specifically a long) which was a valid data type for the job. All of those documents were skipped. When I reduced the the number analyzed fields to 36 (one field for each single character in the string) none of the documents were skipped and the job proceeded.

davemoore · July 28, 2020, 4:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
"All fetched documents included fields with arrays of values and cannot be visualized." Elasticsearch elastic-stack-machine-learning	1	370	July 11, 2021
Dataframe Analytics Job Failed to apply boolean mapping Kibana elastic-stack-machine-learning	4	879	June 12, 2020
Machine Learning Analytics Job error Elasticsearch elastic-stack-machine-learning	2	1126	August 3, 2020
Data frame analytics on scripted fields ML possible? Elasticsearch elastic-stack-machine-learning	18	912	November 22, 2021
ML - Updated analytics task state to [failed] with reason [Limit of total fields [1000] has been exceeded] Elasticsearch elastic-stack-machine-learning	3	407	December 14, 2021

Understanding skipped documents in Data Frame Analytics

Related topics