Dropping Documents on Hive Import with elasticsearch-hadoop-5.4.0

kpaAZ · May 18, 2017, 7:01pm

We've got a fairly simple Hive pipeline setup for basic reporting. Just recently updated the ES adapter and ES cluster.

We've been trying to run an import job from ES (5.3.0, 3 data nodes) to Hive (Hive 1.1.0-cdh5.7.0) using the elasticsearch-hadoop-5.4.0 adapter but have noticed that multiple documents are missing. Previously, we were using version 2.3.2 of the adapter with version 1.7 of ES (also 3 data nodes).

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = '${esIndex}', 
   'es.nodes' = '${esHost}',
   'es.scroll.size' = '2000',
   'es.mapping.names' = '${esMappings}',
   'es.nodes.wan.only' = 'false',
   'es.query' = '?q=${esQuery}');

I've been hitting our ES instance directly using the query for a small subset of the import and get all expected documents. When I do a simple select query against the table, I seem to be missing between 1-10 documents (expected 32).

ADD JAR hdfs:///path/to/jars/elasticsearch-hadoop-5.4.0.jar;

select * from myDB.myTable where ;
Additionally, the documents which are missing are not consistent. If I run the hive query multiple times, different documents will be missing each time.

james.baiera · May 20, 2017, 6:46pm

Could you enable TRACE level logging for the org.elasticsearch.hadoop.rest.commonshttp package? This should bring up each request to and response from Elasticsearch that the connector sends. After doing so, could you check to see if all the expected documents are making it back across the wire in the logs? This will highlight if it is a problem with the connector or with Elasticsearch.

kpaAZ · May 31, 2017, 8:37pm

We've gone ahead and set the logging level to TRACE for that package. Unfortunately, for the import itself, we are pulling in approximately 3.8 million documents into a secondary parquet table which I am then querying making it difficult to verify what was dropped.

Something that I have noticed and seems to have corrected my issue is the adjustment of
es.input.max.docs.per.partition. I changed this from the default of 100000 to 500000 and no longer seem to have any issues with dropped documents so far. I'm relatively new to this adapter/ES and not entirely sure why this seems to have resolved the issue. Additionally, if this is the correct fix, I'm curious about what might happen in the future as the number of records to import grows.

system · June 28, 2017, 8:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate data on hadoop Elasticsearch	2	813	July 6, 2017
ES HADOOP(7.9.0) , ELASTICSEARCH(7.9.0), HIVE(3.1.2) Elasticsearch es-hadoop	1	730	March 5, 2021
Hive overwhelming Elasticsearch Elasticsearch es-hadoop	24	1433	May 18, 2021
Wrong number of docs in elasticsearch Elasticsearch	2	369	April 26, 2018
Whether I should use elasticsearch-spark-20_2.11-5.2.2.jar other than elasticsearch-hadoop-hive-.5.2.2.jar for loading hive table into Elasticsearch? Elasticsearch es-hadoop	2	1167	May 5, 2017

Dropping Documents on Hive Import with elasticsearch-hadoop-5.4.0

Related topics