Need Help Tuning ES for Large Data

niraj_kumar · June 28, 2017, 11:30pm

Hello Everyone,

I have a relatively large data where we use a custom python script to ingest data using the bulk API. These are AWS Cloudtrail data for over 200 AWS Accounts aggregated in single s3 bucket. This has been running good since couple of years but as the data and number of AWS is getting added each day, it has become a nightmare to manage the number of shards it builds. Currently it creates a index for every account and every region. Say like bigdata-us-east- , and then us-west-2-. This goes on for every combination for all 13 AWS regions and 200 AWS accounts. So there are lots of shards getting created every single day. So if a single instance goes down it takes a really long time to rebuild it.

If someone can suggest a better way of indexing data which reduces the motherload of shards, it would really help.

My cluster health output:-

{
"cluster_name": "elk-prod",
"status": "red",
"timed_out": false,
"number_of_nodes": 7,
"number_of_data_nodes": 1,
"active_primary_shards": 198,
"active_shards": 198,
"relocating_shards": 0,
"initializing_shards": 4,
"unassigned_shards": 29020,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 14,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 19148,
"active_shards_percent_as_number": 0.6775716925603997
}

FYI this is a 9 node cluster with 3 data, ingest and master node served with HA Proxy.

Please let me know if you need any further details from my cluster and i am ready to provide it.

--
Niraj

warkolm · June 28, 2017, 11:31pm

What version are you on?

niraj_kumar · June 28, 2017, 11:59pm

5.2.1

warkolm · June 29, 2017, 2:39am

Use the shrink API to reduce your shard count, then look at ways to merge some of your indices.

Maybe by region, or by account but do monthly/yearly and a low shard count?

niraj_kumar · June 29, 2017, 5:29am

Thanks Mark for the suggestion. Can you show me an example of shrinking. I was looking at the documentation but found it a bit complicated.

Also will the shrinking results in any data loss as these are actual audit data of customers.

--
Niraj

warkolm · June 29, 2017, 10:09am

No it will not.

Does And the big one said "Rollover" — Managing Elasticsearch time-based indices efficiently | Elastic Blog help?

niraj_kumar · June 30, 2017, 6:45am

Thanks for the link mark that was helpful. So i should basically create a small script and move/shrink over some old data to reduce the shard count. Right.

warkolm · June 30, 2017, 7:19am

For some short term relief, yep.

warkolm · July 17, 2017, 9:11pm

A quick question @niraj_kumar , are you using the default number of shards or have you customised them?

niraj_kumar · July 17, 2017, 9:39pm

@warkolm I have customized them to three.

niraj_kumar · August 1, 2017, 4:25am

@warkolm. Do we have a documentation for merging indices.

warkolm · August 1, 2017, 5:52am

Indicies? No, you will need to use the reindex API for that.

system · August 29, 2017, 5:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reducing the number of shards on an index without the source Elasticsearch	1	307	July 6, 2017
Tips on Optimization Elasticsearch	10	1411	November 6, 2017
Shrink (merge?) performance optimization Elasticsearch	3	2416	October 19, 2017
Increasing shards and then nodes Elasticsearch	12	923	July 6, 2017
Poor Performance - Cluster Elasticsearch	4	372	May 31, 2018

Need Help Tuning ES for Large Data

Related topics