Hello Everyone,
I have a relatively large data where we use a custom python script to ingest data using the bulk API. These are AWS Cloudtrail data for over 200 AWS Accounts aggregated in single s3 bucket. This has been running good since couple of years but as the data and number of AWS is getting added each day, it has become a nightmare to manage the number of shards it builds. Currently it creates a index for every account and every region. Say like bigdata-us-east- , and then us-west-2-. This goes on for every combination for all 13 AWS regions and 200 AWS accounts. So there are lots of shards getting created every single day. So if a single instance goes down it takes a really long time to rebuild it.
If someone can suggest a better way of indexing data which reduces the motherload of shards, it would really help.
My cluster health output:-
{
"cluster_name": "elk-prod",
"status": "red",
"timed_out": false,
"number_of_nodes": 7,
"number_of_data_nodes": 1,
"active_primary_shards": 198,
"active_shards": 198,
"relocating_shards": 0,
"initializing_shards": 4,
"unassigned_shards": 29020,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 14,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 19148,
"active_shards_percent_as_number": 0.6775716925603997
}
FYI this is a 9 node cluster with 3 data, ingest and master node served with HA Proxy.
Please let me know if you need any further details from my cluster and i am ready to provide it.
--
Niraj