Data size on disk increase 15 times when moved from hive to elasticsearch

charvi23 · May 26, 2022, 5:09am

I am using hive-hadoop jar to move data from hive to Elasticsearch. The data size on disk is ~ 35GB which when moved to ES becomes ~600GB. Is this expected behaviour? Or is there that something is missing from our end.

Replica:1
Shard: 1

warkolm · May 26, 2022, 6:13am

What sort of data is it? What are the mappings you are using? What version of Elasticsearch?

charvi23 · May 26, 2022, 6:19am

Following is the sample of index. Similar type of 700 fields are there. Data on hive is also of similar format rows n columns mapping to these data types. Elasticsearch version is 8. But using compatibility mode "true" for supporting rest high level client.

{"dev_sample":{"aliases":{},"mappings":{"properties":{"acquisition_mgm_addressable":{"type":"integer"},"acquisition_mgm_event_h":{"type":"integer"},"data_usage_segment_h":{"type":"text","fields":{"keyword":{"type":"keyword"}}},"data_usage_segment_n":{"type":"text","fields":{"keyword":{"type":"keyword"}}}}},,"settings":{"index":{"routing":{"allocation":{"include":{"_tier_preference":"data_content"}}},"number_of_shards":"1","provided_name":"dev_sample","creation_date":"1647005730356","number_of_replicas":"1","uuid":"WacLhoypSYWAsJH8_Qtsag","version":{"created":"8000199"}}}}}

warkolm · May 26, 2022, 6:22am

Did you create the mapping or is it dynamic?

charvi23 · May 26, 2022, 6:23am

Created the mapping.

warkolm · May 26, 2022, 6:25am

As in there's over 700 fields? Are they all the same?
Do you need the text and keyword?

charvi23 · May 26, 2022, 6:29am

Few fields are approx 50% are int/double and remaining 50% are both text n keyword. Required for searching. Total number of fields: 703

Keith_Massey · May 26, 2022, 1:09pm

Did you have any indexing in Hive? Elasticsearch is indexing all of your data for search, which has a cost. 17x does sound incredibly high, but if you're now indexing 703 fields for every document, and previously you were only storing source, that doesn't sound crazy. Do you need all the fields indexed? Including your mapping here might help.
Also what was your replication in HDFS? Does the 35 GB include replicas?

charvi23 · May 26, 2022, 1:15pm

Can you please help me to differentiate between storing n indexing. As per my understanding store means this
By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.

We are explicitly not storing it. Do you mean adding text n keyword both to each field?

Also this does not include replica, but currently I am working on single cluster, so ES data is also not replicated.

I cannot actually share complete mapping due to client restrictions. All the fields are similar to what have been mentioned above.

charvi23 · May 26, 2022, 1:28pm

Also can this storing and indexing make the data size 17x , I can imagine it to be doubled/tripled at max.

Keith_Massey · May 26, 2022, 2:08pm

If you are setting number_of_replicas in Elasticsearch to 1, you're actually getting two copies of the data (one primary and one replica). The terminology is a little different from HDFS, where a replication of 1 means there is just one copy of the data.

Christian_Dahlqvist · May 26, 2022, 3:07pm

If you are indexing 35GB of raw data with that many fields I suspect you would be better off with a larger number of primary shards. I would recommend setting it to 3 or perhaps even 5 in an index template.

I would also recommend you go though this guide if you have not already.

I would expect the text fields to have the potential to take up a lot of space here. Exactly how much space they will take up should depend on the length and cardinality of the text fields as well as how they tokenize with the default analyzer.

If you have fields that you want to aggregate on but are not normal text that can be broken up by the default analyzer I would recommend mapping these as keyword only where possible.

If you have long text fields with high cardinality, check if you can map these as text only and not keyword.

Having said that I would acknowldege that your ratio looks extreme (I have never seen anything even remotely that high as far as I can recall), but it is very difficult to determine why without analysis of the data.

BenB196 · May 26, 2022, 6:51pm

You could take a look at: Analyze index disk usage API | Elasticsearch Guide [8.2] | Elastic to see where the disk usage is coming from in the index. You might need to run the command with run_expensive_tasks=true to get all the needed information (not exactly sure what is returned when you don't include it)

charvi23 · May 27, 2022, 4:41am

I am working on single node cluster, in that case too will replica factor of 1 mean two copies? My understanding was in single node cluster replica factor 1 does not mean anything. Isn't that true?

charvi23 · May 30, 2022, 9:03am

I tried running, but I don't get any output even after this query runs 6-7 hrs.

BenB196 · May 30, 2022, 11:20am

Hmm, that doesn't sound right, that query shouldn't take anywhere near that long. Could I ask how you're running the query? Via Kibana console, curl, etc..? If you're using the Kibana console, I'd suggest trying the command with curl, as the Kibana console doesn't really handle long running queries nicely.

charvi23 · May 30, 2022, 12:38pm

I tried using curl that gives the following output.

<HEAD><TITLE>Connection Timed Out</TITLE></HEAD>
<BODY BGCOLOR="white" FGCOLOR="black"><H1>Connection Timed Out</H1><HR>
<FONT FACE="Helvetica,Arial"><B>
Description: Connection Timed Out</B></FONT>
<HR>
<!-- default "Connection Timed Out" response (504) -->
</BODY>```




 Postman n Insomnia don't give output for 6-7 hrs

Christian_Dahlqvist · May 30, 2022, 12:44pm

That does not look like Elasticsearch output. Are you going through some proxy that may be applying a timeout?

charvi23 · May 30, 2022, 12:48pm

Yes, but it's client server and I cannot change it. Any work around?

Christian_Dahlqvist · May 30, 2022, 12:50pm

You need to be able to run requests against the node without having timeouts imposed. One way would be to log onto the node and run it locally, but in the end what is and is not possible will depend on your infrastructure.

Topic		Replies	Views
Not able to load data from hive to Elasticsearch using ESStorage Handler Elasticsearch es-hadoop	14	2664	June 7, 2018
Elasticsearch index MUCH larger then similar lucene index Elasticsearch	54	1321	July 6, 2017
Memory requirements and settings Elasticsearch	8	3032	July 6, 2017
Issue of elasticsearch-hadoop-2.0.0 with Hive (cloudera and hortonworks), helps are needed Elasticsearch	4	585	July 6, 2017
Reducing Disk Space Requirements/ Deduplication? Zipping? Elasticsearch	5	2331	July 6, 2017

Data size on disk increase 15 times when moved from hive to elasticsearch

Related topics