Docs are skipped either because included fields have missing values and the analysis doesn't support missing values (ie. outlier_detection), or because they have a value the job cannot analyze (ie. an array with more than 1 element).
You can also use the Explain API to see exactly which fields are included in the analysis and then check for which of them contain missing values (if you're using outlier_detection only) or arrays in your source index.
Thanks @dmitri. I resolved the issue, and I believe it was caused by the number of analyzed fields in the documents.
My documents had 1,260 analyzed fields: one field for each 2-character permutation of the string 0123456789abcdefghijklmnopqrstuvwxyz. Each field was a numerical count of the bigram (specifically a long) which was a valid data type for the job. All of those documents were skipped. When I reduced the the number analyzed fields to 36 (one field for each single character in the string) none of the documents were skipped and the job proceeded.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.