Reading from ES with hive: How can I get the _score value?

limudonline · April 3, 2018, 11:33am

Hi ,
I'm trying to read the _score value from elastic through Hive (I need to store it in a Hive table for farther data investigation).

I tried the following but the "_score" field returns with null instead of its right value.

I use Elasticsearch 6.1.1 and CDH 5.13 and here is what I did:

I loaded 2 simple docs to elastic as follow:

  PUT hdptst/txt/1
  {
        "txt":"this is a test"
  }

  PUT hdptst/txt/2
  {
      "txt":"this is also a  test"
  }

Then I created a Hive table as follow:

 CREATE EXTERNAL TABLE es_to_hive_tbl(
      txt string,
      metadata map<string,string>)
  STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
  TBLPROPERTIES('es.nodes'=<my node>, 'es.resource'= 'hdptst/txt' ,
  'es.read.metadata' = 'true' , 
  'es.read.metadata.field' = 'metadata'
  );

When I run -

  select * from es_to_hive_tbl

I get the following result:

              txt                            metadata
   1     this is a test           | {_index:hdptst,_type:txt,_id:1,_score:null}
   2    this is a also test       | {_index:hdptst,_type:txt,_id:2,_score:null}

Note that _score is null while it should be equals to '1'.

I also tried to define metadata as struct as follow
struct<_index:string,_type:string,_id:string,_score:float> but got the same results.

Is it a bug? Do I miss something? How can I get the _score value?

Thank you

james.baiera · April 16, 2018, 8:11pm

Sorry for the late reply on this. This is indeed a bug. I've committed a fix for it on master.

This is a problem with the underlying scroll requests to Elasticsearch. We explicitly set the document sort for our scroll requests to use _doc for performance reasons. I've made a change to the request format to calculate and send back the _score of each document when reading metadata is enabled. This may see a small hit to performance in the case of pushdown queries that are complex, but reading the metadata of a field is generally uncommon enough that this performance hit is reasonable to get correct results. If the performance hit is ever too high, users can get around it by encoding the metadata into the document source and skip the read metadata feature.

Thanks for the post here!

system · May 14, 2018, 8:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hive integration with Elasticsearch show nulls fileds Elasticsearch es-hadoop	4	1240	August 9, 2017
Hive read operation fails when stored as external table pointing to Elastic search location Elasticsearch es-hadoop	4	2067	May 30, 2018
Getting _id field in elasticsearch to map to a field in HIVE Elasticsearch	4	1903	November 4, 2022
Reading json data from ES to HIVE with a single string field Elasticsearch es-hadoop	4	1666	July 6, 2017
Elasticsearch hive query Elasticsearch es-hadoop	2	818	August 14, 2017

Reading from ES with hive: How can I get the _score value?

Related Topics