Hive to Elastic Data Load

Hi,

I have loaded a data containing 536965213 rows from hive to elastic. But when I check the data loaded in elastic via Kibana it displays " 548,889,213 hits which is more than the number of records present in my hive table.

I had read about "precision_threshold" from the below link.

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

But I still have few questions:

  • If the data cardinality is high then why does elastic create some extra records which might become a memory issue where there is an use case like this i.e., dealing with huge data?

  • Will this produce any bad/duplicate data when I try reading data using Java REST API?

Also in the link provided above explains me to add hash properties in my mapping.

  • Does this mean that do I need to add hash property to all the fields that I use in my search usecase? or is it only for making my search more quicker?

  • Can this count in my elastic search matches with my hive after using this "hash" property in mapping?

Just for reference. Below are the table properties used and my table has 40 columns as of now and they might increase till 150 columns moving forward.

ROW FORMAT SERDE
  'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY
  'org.elasticsearch.hadoop.hive.EsStorageHandler'
WITH SERDEPROPERTIES (
  'serialization.format'='1')
LOCATION
  'maprfs:/XXX/xxxx/XXXX.db/xxxxxx'
TBLPROPERTIES (
  'es.index.auto.create'='true',
  'es.index.read.missing.as.empty'='true',
  'es.net.ssl'='false',
  'es.nodes'='http://xxxxxxx',
  'es.nodes.client.only'='false',
  'es.nodes.discovery'='true',
  'es.nodes.wan.only'='false',
  'es.port'='9200',
  'es.resource'='eshive/xxxx',
  'last_modified_by'='xxxxx',
  'last_modified_time'='1584455649',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1584455649')

Regards,
Sarath

@dadoonet can you please help me on this?

Please be patient in waiting for responses to your question and refrain from pinging multiple times asking for a response or opening multiple topics for the same question. This is a community forum, it may take time for someone to reply to your question. For more information please refer to the Community Code of Conduct specifically the section "Be patient". Also, please refrain from pinging folks directly, this is a forum and anyone that participates might be able to assist you.

If you are in need of a service with an SLA that covers response times for questions then you may want to consider talking to us about a subscription.

It's fine to answer on your own thread after 2 or 3 days (not including weekends) if you don't have an answer.

I don't know Hive.

What is the index name?

Index name is "eshive".

My Apologies for tagging you.

I really don't know that this is something that was addressed previously in this forum.

Please share link if this is addressed previously.

How do you check the number of documents in Kibana? Are you using a cardinality aggregation given that you linked to this or looking at e.g. the output of _cat/indices?

I am simply navigating to discover tab and checking the number of documents present under that particular index.

This is the output I got when I execute _cat/indices:

green open eshive_xxxx PPwGbKyKSpujbZDbYM5jgw 1 1 548889213 0 321.5gb 145.7gb

It does seem like you have had some duplicates inserted as the document count is higher than what is expected, which will hve an impact on query results. It also looks like you have indexed all your data into a single primary shard which quite large.

Clients indexing into Elasticsearch generally provide an at-least-once delivery guarantee, so if you push the cluster too hard or requests end up timing out, e.g. due to long GC it is possible that clients will resent the requests which can cause duplicates unless you specify the document ID client side.

I do not know the ES-Hadoop plugin so am not sure how you control concurrency. I would recommend looking at the Elasticsearch logs around the period you performed the indexing and see if there is any evidence of long or slow GC or any other errors/issues.

Also worth investigating is if any of the write tasks failed and restarted on the Hadoop side. If any tasks fail when writing data to Elasticsearch they will be restarted, and all previously accepted documents from that split will be rewritten. This can be avoided by telling ES-Hadoop which field should be considered a unique identifier for a document. In this case, ES-Hadoop will make sure the value from that field is extracted and set as the document's id for the purposes of avoiding duplicates.

Sure James I will investigate on this and get back to you with the details.

Yes, this count is because of the mappers that were killed.

Thankq for bringing this point.

Is there any way we can restrict the duplicate records in this scenario?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.