Elasticsearch-Hadoop Integration Issue: cannot find class definition

Hi helpful community!

I am trying to load data from hive to Elasticsearch through a JDBC driver connection. I was able to execute my create index query and load index query without errors. However, I got java errors about Class Definition not found but I confirmed the classes are in the jar files by browsing through window explorer. I've tried several different versions of Elasticsearch-hadoop jar files but none of them seem to work (Please see attached picture for all versions tried)

My setup is as such: hive data stored in json format saved in a google bucket and then accessing it through a cluster on google cloud platform. Hive version is 2.3.7

Create Elasticsearch index table query:

use temp;
add jar gs://.../elasticsearch-hadoop-7.16.2.jar;
set hive.aux.jars.path=gs://.../elasticsearch-hadoop-7.16.2.jar;
set hive.cli.print.header=true;

DROP TABLE IF EXISTS temp.salina_test_es;

CREATE EXTERNAL TABLE temp.salina_test_es(
    creation_ts         date               
    ,order_number        double                           
    ,customer_id         string               
    ,order_date          date               
) 
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' 
TBLPROPERTIES('es.resource' = 'test_index/transactions',
		'es.nodes'='xxx', 'es.port'='9200',
		'es.nodes.wan.only' = 'true', 'es.batch.write.retry.count'='-1',
		'es.batch.write.retry.wait'='2','es.bulk.size.bytes'='50',
		'es.bulk.size.entries'='200','es.index.auto.create' = 'true',
		'es.write.rest.error.handler.log.logger.level'='ERROR','es.write.rest.error.handlers'='log',
		'es.write.rest.error.handler.log.logger.name'='BulkErrors'); 

Load Elasticsearch index query:

add jar gs://.../elasticsearch-hadoop-7.16.2.jar;
set hive.aux.jars.path=gs://.../elasticsearch-hadoop-7.16.2.jar;
set hive.cli.print.header=true;

INSERT OVERWRITE TABLE temp.salina_test_es
SELECT date_format(creation_ts, 'yyyy-MM-dd') as creation_ts       
    ,cast(order_number as double) as order_number                        
    ,customer_id         string               
    ,date_format(order_date, 'yyyy-MM-dd') as order_date                 
FROM source_tbl;

Screen Shot 2021-12-28 at 12.27.51 PM

Error

Error: java.lang.NoClassDefFoundError: org/elasticsearch/hadoop/thirdparty/apache/commons/httpclient/Wire (state=,code=0) 

Is there more to the log file that you could post? Or is there a full stack trace available? Sometimes NoClassDefFoundErrors are misleading, and it could be something that org.Elasticsearch.hadoop.thirdparty.apache.commons.httpclient.Wire is trying to pull in is missing.

For what it's worth, the only non-jvm classes pulled in by org.Elasticsearch.hadoop.thirdparty.apache.commons.httpclient.Wire are in commons-logging, so it might be worth checking whether a commons-logging jar is in your classpath.

Hi Keith,

Thank you for looking into my issue!
I reproduced the error but the log is exactly the same as the one I posted in my original post. However, I added the commons-logging-1.1.3.jar which wasn't in the Elasticsearch-hadoop-7.16.2.jar so the Wire class def not found error went away but a new error appeared as shown below.

Error: java.lang.NoClassDefFoundError: org/elasticsearch/hadoop/thirdparty/apache/commons/httpclient/protocol/DefaultProtocolSocketFactory (state=,code=0)

Hmm, org.Elasticsearch.hadoop.thirdparty.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory appears to be in the Elasticsearch-hadoop jar, and there's nothing new pulled in by it. The only other thing that seems odd is that you're pulling the jar in with both "add jar" and "hive.aux.jars.path" -- I don't think both of those are necessary? I just tried your code on hive 2.3.2 and Elasticsearch 7.16.2 with both Elasticsearch-hadoop-7.16.0.jar and Elasticsearch-hadoop-8.1.0-SNAPSHOT.jar. I modified your steps a little bit but not in ways that I would expect to matter:

add jar /opt/elasticsearch-hadoop-8.1.0-SNAPSHOT.jar;
set hive.cli.print.header=true;

CREATE EXTERNAL TABLE salina_test_es(
    creation_ts         date               
    ,order_number        double                           
    ,customer_id         string               
    ,order_date          date               
)
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'test_index/transactions',
'es.nodes'='172.19.0.3', 'es.port'='9200',
'es.nodes.wan.only' = 'true', 'es.batch.write.retry.count'='-1',
'es.batch.write.retry.wait'='2','es.bulk.size.bytes'='50',
'es.bulk.size.entries'='200','es.index.auto.create' = 'true',
'es.write.rest.error.handler.log.logger.level'='ERROR','es.write.rest.error.handlers'='log',
'es.write.rest.error.handler.log.logger.name'='BulkErrors');

CREATE EXTERNAL TABLE source_tbl(
    creation_ts         date               
    ,order_number        double                           
    ,customer_id         string               
    ,order_date          date               
);

insert into source_tbl values ('2021-01-01', 5.3, '1234', '2021-02-02');

INSERT OVERWRITE TABLE salina_test_es
SELECT date_format(creation_ts, 'yyyy-MM-dd') as creation_ts       
,cast(order_number as double) as order_number                        
,customer_id string               
,date_format(order_date, 'yyyy-MM-dd') as order_date
FROM source_tbl;

That all succeeded, and then I could query the data out of Elasticsearch.

I forgot to add that I didn't have to add a commons-logging jar. It is getting picked up correctly from my hive installation without any work on my part. I'm wondering if something is broken in your hive or hadoop installation. To get a lot more information about where classes are getting loaded from, you could try running hive shell like:

HADOOP_CLIENT_OPTS=-verbose /opt/hive/bin/hive

That will print out where every single class is getting loaded from, which is incredibly verbose. But it could be useful.

Thanks for testing out my queries, Keith!

I tested with Elasticsearch-hadoop-7.16.0.jar but am still getting the same error. I couldn't find Elasticsearch-hadoop-8.1.0-SNAPSHOT.jar, where can I download it?
Today I was able to create the index in Elasticsearch (I checked using Dev Tools) but I cannot pull data from ES by doing select on salina_test_es. I just found out that I need to get permission from the security team before I could use specific functions in the jar files. To make matter worse, the error logging has been restricted due to not enough memory. I tried the --verbose=true option in hive but it didn't seem to help and I suspect it is because of the restricted error logging.

Is there any functions in the DefaultProtocolSocketFactory class that might potentially cause security issues? I wonder if those are the ones that get blocked.

You don't need the 8.1.0 jar -- I built it from source code but it won't behave any differently for this problem than the other versions.
I don't know of any security problems in DefaultProtocolSocketFactory, but I have no idea what your security team has in place.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.