We've got a fairly simple Hive pipeline setup for basic reporting. Just recently updated the ES adapter and ES cluster.
We've been trying to run an import job from ES (5.3.0, 3 data nodes) to Hive (Hive 1.1.0-cdh5.7.0) using the elasticsearch-hadoop-5.4.0 adapter but have noticed that multiple documents are missing. Previously, we were using version 2.3.2 of the adapter with version 1.7 of ES (also 3 data nodes).
I've been hitting our ES instance directly using the query for a small subset of the import and get all expected documents. When I do a simple select query against the table, I seem to be missing between 1-10 documents (expected 32).
ADD JAR hdfs:///path/to/jars/elasticsearch-hadoop-5.4.0.jar;
select * from myDB.myTable where ;
Additionally, the documents which are missing are not consistent. If I run the hive query multiple times, different documents will be missing each time.
Could you enable TRACE level logging for the org.elasticsearch.hadoop.rest.commonshttp package? This should bring up each request to and response from Elasticsearch that the connector sends. After doing so, could you check to see if all the expected documents are making it back across the wire in the logs? This will highlight if it is a problem with the connector or with Elasticsearch.
We've gone ahead and set the logging level to TRACE for that package. Unfortunately, for the import itself, we are pulling in approximately 3.8 million documents into a secondary parquet table which I am then querying making it difficult to verify what was dropped.
Something that I have noticed and seems to have corrected my issue is the adjustment of es.input.max.docs.per.partition. I changed this from the default of 100000 to 500000 and no longer seem to have any issues with dropped documents so far. I'm relatively new to this adapter/ES and not entirely sure why this seems to have resolved the issue. Additionally, if this is the correct fix, I'm curious about what might happen in the future as the number of records to import grows.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.