Hive to Elastic Data Load

Sarath_Pullabhotla · April 15, 2020, 9:30am

Hi,

I have loaded a data containing 536965213 rows from hive to elastic. But when I check the data loaded in elastic via Kibana it displays " 548,889,213 hits which is more than the number of records present in my hive table.

I had read about "precision_threshold" from the below link.

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

But I still have few questions:

If the data cardinality is high then why does elastic create some extra records which might become a memory issue where there is an use case like this i.e., dealing with huge data?
Will this produce any bad/duplicate data when I try reading data using Java REST API?

Also in the link provided above explains me to add hash properties in my mapping.

Does this mean that do I need to add hash property to all the fields that I use in my search usecase? or is it only for making my search more quicker?
Can this count in my elastic search matches with my hive after using this "hash" property in mapping?

Just for reference. Below are the table properties used and my table has 40 columns as of now and they might increase till 150 columns moving forward.

ROW FORMAT SERDE
  'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY
  'org.elasticsearch.hadoop.hive.EsStorageHandler'
WITH SERDEPROPERTIES (
  'serialization.format'='1')
LOCATION
  'maprfs:/XXX/xxxx/XXXX.db/xxxxxx'
TBLPROPERTIES (
  'es.index.auto.create'='true',
  'es.index.read.missing.as.empty'='true',
  'es.net.ssl'='false',
  'es.nodes'='http://xxxxxxx',
  'es.nodes.client.only'='false',
  'es.nodes.discovery'='true',
  'es.nodes.wan.only'='false',
  'es.port'='9200',
  'es.resource'='eshive/xxxx',
  'last_modified_by'='xxxxx',
  'last_modified_time'='1584455649',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1584455649')

Regards,
Sarath

Sarath_Pullabhotla · April 16, 2020, 7:50am

@dadoonet can you please help me on this?

dadoonet · April 16, 2020, 8:18am

Please be patient in waiting for responses to your question and refrain from pinging multiple times asking for a response or opening multiple topics for the same question. This is a community forum, it may take time for someone to reply to your question. For more information please refer to the Community Code of Conduct specifically the section "Be patient". Also, please refrain from pinging folks directly, this is a forum and anyone that participates might be able to assist you.

If you are in need of a service with an SLA that covers response times for questions then you may want to consider talking to us about a subscription.

It's fine to answer on your own thread after 2 or 3 days (not including weekends) if you don't have an answer.

dadoonet · April 16, 2020, 8:19am

I don't know Hive.

What is the index name?

Sarath_Pullabhotla · April 16, 2020, 8:53am

Index name is "eshive".

Sarath_Pullabhotla · April 16, 2020, 8:56am

My Apologies for tagging you.

I really don't know that this is something that was addressed previously in this forum.

Please share link if this is addressed previously.

Christian_Dahlqvist · April 16, 2020, 9:06am

How do you check the number of documents in Kibana? Are you using a cardinality aggregation given that you linked to this or looking at e.g. the output of _cat/indices?

Sarath_Pullabhotla · April 16, 2020, 9:16am

I am simply navigating to discover tab and checking the number of documents present under that particular index.

Sarath_Pullabhotla · April 17, 2020, 12:21pm

This is the output I got when I execute _cat/indices:

green open eshive_xxxx PPwGbKyKSpujbZDbYM5jgw 1 1 548889213 0 321.5gb 145.7gb

Christian_Dahlqvist · April 17, 2020, 12:37pm

It does seem like you have had some duplicates inserted as the document count is higher than what is expected, which will hve an impact on query results. It also looks like you have indexed all your data into a single primary shard which quite large.

Clients indexing into Elasticsearch generally provide an at-least-once delivery guarantee, so if you push the cluster too hard or requests end up timing out, e.g. due to long GC it is possible that clients will resent the requests which can cause duplicates unless you specify the document ID client side.

I do not know the ES-Hadoop plugin so am not sure how you control concurrency. I would recommend looking at the Elasticsearch logs around the period you performed the indexing and see if there is any evidence of long or slow GC or any other errors/issues.

james.baiera · April 21, 2020, 4:45pm

Also worth investigating is if any of the write tasks failed and restarted on the Hadoop side. If any tasks fail when writing data to Elasticsearch they will be restarted, and all previously accepted documents from that split will be rewritten. This can be avoided by telling ES-Hadoop which field should be considered a unique identifier for a document. In this case, ES-Hadoop will make sure the value from that field is extracted and set as the document's id for the purposes of avoiding duplicates.

Sarath_Pullabhotla · April 22, 2020, 10:00am

Sure James I will investigate on this and get back to you with the details.

Sarath_Pullabhotla · April 24, 2020, 2:18pm

Yes, this count is because of the mappers that were killed.

Thankq for bringing this point.

Sarath_Pullabhotla · April 27, 2020, 11:23am

Is there any way we can restrict the duplicate records in this scenario?

system · May 25, 2020, 11:23am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data duplicated in Elasticsearch when added from Hive - RESOLVED Elasticsearch es-hadoop	3	1157	August 23, 2018
Error when moving data from hive using elasticsearch-hadoop plugin to elasticsearch Elasticsearch es-hadoop	3	976	June 27, 2018
Duplicate data on hadoop Elasticsearch	2	823	July 6, 2017
Issue of elasticsearch-hadoop-2.0.0 with Hive (cloudera and hortonworks), helps are needed Elasticsearch	4	589	July 6, 2017
Hive to elastic search Data loading...Index status = RED Elasticsearch es-hadoop	2	1156	July 6, 2017

Hive to Elastic Data Load

Related topics