Scaling up for petabyte sizes?

lvic · April 30, 2018, 4:31pm

Had a performance problem: ES queries become really slow when dataset size grew to several petabytes... What approach can I use to scale up for larger datasets while preserving original data? For example: is it possible to increase number of primary shards for running ES?
Thank you,

Christian_Dahlqvist · April 30, 2018, 4:34pm

How much data did you have? How many nodes? Did you identify what was limiting performance (CPU, memory, network, disk)?

lvic · April 30, 2018, 6:17pm

3 nodes / 3 primary shards. Don't see resource over utilization(s) as such, probably the most limiting one is disk usage - ES data takes ~80% of available disk space

Christian_Dahlqvist · April 30, 2018, 6:24pm

How is this related to the 3 node cluster?

lvic · April 30, 2018, 6:35pm

Sorry, not sure i understand

lvic · April 30, 2018, 6:38pm

Here's the general question: is it possible to increase cluster/sharding size while preserving existing data?

Christian_Dahlqvist · April 30, 2018, 6:38pm

You said you had performance problems when the dataset grew to several petabytes. That is clearly not possible with the 3 nodes you then mentioned, which leaves me confused.

lvic · April 30, 2018, 6:39pm

Sorry, my bad. I mean 3 replicas

Christian_Dahlqvist · April 30, 2018, 6:41pm

I still do not understand. Can you please clarify? How much data did you have in the cluster? How many nodes were used?

lvic · April 30, 2018, 6:45pm

close to 2 petabytes of data, 3 nodes.

Christian_Dahlqvist · April 30, 2018, 6:45pm

That is not possible. Are you mixing up your units? Is it by any chance 2 terabytes?

Easiest way to determine the data amount id probably to provide us the full output of the cluster stats API.

lvic · April 30, 2018, 6:50pm

Why is it not possible? Is there some limit?

Christian_Dahlqvist · April 30, 2018, 6:56pm

Please provide the output from the API I linked to.

Mike.Barretta · May 4, 2018, 8:03pm

@lvic if you just want to know how to change the shard count of an existing index, see:

system · June 1, 2018, 8:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance implications of multiple primary shards for one index on the same data node Elasticsearch	4	1715	June 24, 2021
About scalability in data volume Elasticsearch	4	382	July 6, 2017
Sharding Strategy when data is in Tb Elasticsearch	6	303	May 5, 2024
Using ES for 1 PB of unstructured data Elasticsearch	4	3434	July 5, 2017
Shard Count and Index Splitting Strategies about PB-Level Storage in Elasticsearch Elasticsearch elastic-stack-security	3	190	June 22, 2024

Scaling up for petabyte sizes?

Related topics