Reading from ES with hive: How can I get the _score value?

Hi ,
I'm trying to read the _score value from elastic through Hive (I need to store it in a Hive table for farther data investigation).

I tried the following but the "_score" field returns with null instead of its right value.

I use Elasticsearch 6.1.1 and CDH 5.13 and here is what I did:

I loaded 2 simple docs to elastic as follow:

  PUT hdptst/txt/1
  {
        "txt":"this is a test"
  }

  PUT hdptst/txt/2
  {
      "txt":"this is also a  test"
  }

Then I created a Hive table as follow:

 CREATE EXTERNAL TABLE es_to_hive_tbl(
      txt string,
      metadata map<string,string>)
  STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
  TBLPROPERTIES('es.nodes'=<my node>, 'es.resource'= 'hdptst/txt' ,
  'es.read.metadata' = 'true' , 
  'es.read.metadata.field' = 'metadata'
  );

When I run -

  select * from es_to_hive_tbl 

I get the following result:

              txt                            metadata
   1     this is a test           | {_index:hdptst,_type:txt,_id:1,_score:null}
   2    this is a also test       | {_index:hdptst,_type:txt,_id:2,_score:null}

Note that _score is null while it should be equals to '1'.

I also tried to define metadata as struct as follow
struct<_index:string,_type:string,_id:string,_score:float> but got the same results.

Is it a bug? Do I miss something? How can I get the _score value?

Thank you

Sorry for the late reply on this. This is indeed a bug. I've committed a fix for it on master.

This is a problem with the underlying scroll requests to Elasticsearch. We explicitly set the document sort for our scroll requests to use _doc for performance reasons. I've made a change to the request format to calculate and send back the _score of each document when reading metadata is enabled. This may see a small hit to performance in the case of pushdown queries that are complex, but reading the metadata of a field is generally uncommon enough that this performance hit is reasonable to get correct results. If the performance hit is ever too high, users can get around it by encoding the metadata into the document source and skip the read metadata feature.

Thanks for the post here!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.