Data size on disk increase 15 times when moved from hive to elasticsearch

I am using hive-hadoop jar to move data from hive to Elasticsearch. The data size on disk is ~ 35GB which when moved to ES becomes ~600GB. Is this expected behaviour? Or is there that something is missing from our end.

Shard: 1

What sort of data is it? What are the mappings you are using? What version of Elasticsearch?

Following is the sample of index. Similar type of 700 fields are there. Data on hive is also of similar format rows n columns mapping to these data types. Elasticsearch version is 8. But using compatibility mode "true" for supporting rest high level client.


Did you create the mapping or is it dynamic?

Created the mapping.

As in there's over 700 fields? Are they all the same?
Do you need the text and keyword?

Few fields are approx 50% are int/double and remaining 50% are both text n keyword. Required for searching. Total number of fields: 703

Did you have any indexing in Hive? Elasticsearch is indexing all of your data for search, which has a cost. 17x does sound incredibly high, but if you're now indexing 703 fields for every document, and previously you were only storing source, that doesn't sound crazy. Do you need all the fields indexed? Including your mapping here might help.
Also what was your replication in HDFS? Does the 35 GB include replicas?

Can you please help me to differentiate between storing n indexing. As per my understanding store means this
By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.

We are explicitly not storing it. Do you mean adding text n keyword both to each field?

Also this does not include replica, but currently I am working on single cluster, so ES data is also not replicated.

I cannot actually share complete mapping due to client restrictions. All the fields are similar to what have been mentioned above.

Also can this storing and indexing make the data size 17x , I can imagine it to be doubled/tripled at max.

If you are setting number_of_replicas in Elasticsearch to 1, you're actually getting two copies of the data (one primary and one replica). The terminology is a little different from HDFS, where a replication of 1 means there is just one copy of the data.

If you are indexing 35GB of raw data with that many fields I suspect you would be better off with a larger number of primary shards. I would recommend setting it to 3 or perhaps even 5 in an index template.

I would also recommend you go though this guide if you have not already.

I would expect the text fields to have the potential to take up a lot of space here. Exactly how much space they will take up should depend on the length and cardinality of the text fields as well as how they tokenize with the default analyzer.

If you have fields that you want to aggregate on but are not normal text that can be broken up by the default analyzer I would recommend mapping these as keyword only where possible.

If you have long text fields with high cardinality, check if you can map these as text only and not keyword.

Having said that I would acknowldege that your ratio looks extreme (I have never seen anything even remotely that high as far as I can recall), but it is very difficult to determine why without analysis of the data.

You could take a look at: Analyze index disk usage API | Elasticsearch Guide [8.2] | Elastic to see where the disk usage is coming from in the index. You might need to run the command with run_expensive_tasks=true to get all the needed information (not exactly sure what is returned when you don't include it)

1 Like

I am working on single node cluster, in that case too will replica factor of 1 mean two copies? My understanding was in single node cluster replica factor 1 does not mean anything. Isn't that true?

I tried running, but I don't get any output even after this query runs 6-7 hrs.

Hmm, that doesn't sound right, that query shouldn't take anywhere near that long. Could I ask how you're running the query? Via Kibana console, curl, etc..? If you're using the Kibana console, I'd suggest trying the command with curl, as the Kibana console doesn't really handle long running queries nicely.

I tried using curl that gives the following output.

<HEAD><TITLE>Connection Timed Out</TITLE></HEAD>
<BODY BGCOLOR="white" FGCOLOR="black"><H1>Connection Timed Out</H1><HR>
<FONT FACE="Helvetica,Arial"><B>
Description: Connection Timed Out</B></FONT>
<!-- default "Connection Timed Out" response (504) -->

 Postman n Insomnia don't give output for 6-7 hrs

That does not look like Elasticsearch output. Are you going through some proxy that may be applying a timeout?

1 Like

Yes, but it's client server and I cannot change it. Any work around?

You need to be able to run requests against the node without having timeouts imposed. One way would be to log onto the node and run it locally, but in the end what is and is not possible will depend on your infrastructure.

1 Like