Understanding skipped documents in Data Frame Analytics

When running a Data Frame Analytics job in 7.8.0, documents with unsupported values are skipped (source).

Suppose I run a job that fails after the loading_data phase because every document was skipped, and the data_counts of the job is reported as:

{"training_docs_count":0,"test_docs_count":0,"skipped_docs_count":1000}

Is there a way to troubleshoot why a document was skipped during the job?

Hi Dave,

Docs are skipped either because included fields have missing values and the analysis doesn't support missing values (ie. outlier_detection), or because they have a value the job cannot analyze (ie. an array with more than 1 element).

You can also use the Explain API to see exactly which fields are included in the analysis and then check for which of them contain missing values (if you're using outlier_detection only) or arrays in your source index.

Thanks @dmitri. I resolved the issue, and I believe it was caused by the number of analyzed fields in the documents.

My documents had 1,260 analyzed fields: one field for each 2-character permutation of the string 0123456789abcdefghijklmnopqrstuvwxyz. Each field was a numerical count of the bigram (specifically a long) which was a valid data type for the job. All of those documents were skipped. When I reduced the the number analyzed fields to 36 (one field for each single character in the string) none of the documents were skipped and the job proceeded.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.