I have following cluster configuration of ES on AWS:
Instance Type: t2.small.elasticsearch
Number of Nodes: 1
EBS Volume Type: GP2
EBS Volume Size: 10 GiB
Heap size: 1 GiB
I am trying to index around 20000 documents (a sample of my input data) by using Python's bulk API. Following are the cluster stats when I am doing above operation:
Active Threads: 1
Max write queue size: 6
Max Heap size used in percentage: 75%
I have following questions regarding above:
Intermittently, I get error: 403 request throttled due to too many requests for above operation. I tried scaling the the cluster vertically (changed Volume size to 30 GiB) but still the problem persists. How can I resolve the error? What are the other stats I should be checking to debug issue?
As per my understanding, max of 1 write operation can be performed with above configuration. So how does max queue size is 6 even though each stream bulk operation inserts 500 documents? (using Python's hepler.bulk API)
All of that is a recipe for disaster. Index your data into a single index, which given the data volumes you are talking about does not need to have more than a single primary shard. Creating a new index per document is NEVER a good idea as each shard adds overhead and there is a limit to the number of shards in the cluster. As your node has very limited resources with respect to CPU and IOPS I would recommend not sending concurrent bulk requests. If you want to load data faster you need to use more powerful nodes.
I am trying to find an optimal cluster configuration for my ES. Hence, tried with the base configuration first. So to improve indexing for concurrent bulk requests I will have to
Change to a powerful node and Increase the number of nodes.
Decrease the number of shards to 1 for each index.
Also, Should I change heap size when moved to a powerful node?
Given that you only have 10GB storage I would recommend having a single index, not multiple ones. If you need more than one index you should look to keep the number of indices to a minimum.
In order to handle multiple concurrent bulk requests you will need more CPU resources, and an m5 series node is likely a better choice. You are also likely to need more and faster storage. If I recall correctly a 10GB gp2 EBS volume only supports 30IOPS, which is likely to quickly become the bottleneck.
If you switch to a larger and more powerful node or a cluster you should set the heap size to 50% of the available RAM (up to a maximum of around 30GB heap).
It may also help us provide better guidance if you tell us a bit more about your use case and expected data volumes.
A tenant will have m number of products. A product will have p number of types and a type can have t number of subtypes. We do some kind of prediction on the data, process it and store it in ES for future searching purpose.
Data Volume
We have to persist around 5 GB of data for each tenant (around 10 tenants) on hourly basis i.e. we will have to persist 50 GB of data within 1 hour.
Current, we are using following indexing strategy:
I would recommend using a single index per tenant. Do not use the indexing strategy you described as it seems very very inefficient and will cause you lots of problems.
But considering the data size I thought storing the data for types in one index could cause issues as the number of shards and shards size will grow beyond the soft limit of 50 GB specified in most of the ES documents.
Also, could you please let me know what potential problems we can face with the existing indexing strategy?
Edit:
Also, we could be updating indexes with additional data after inserting it in ES.
You initially stated you only had 10GB storage which means you cannot have a problem with large shard size. Please describe the use case and tell us how much data you intend to index per day and how long you will need to keep that data in the cluster. Are you indexing immutable data or will you be performing updates?
Okay.. So, I am currently working on a POC with minimal configuration with a subset of our data. I am testing out different configurations to find out the optimal configuration for the subset of data I have. I can increase the storage or add new nodes as per my discretion (no hard limit of 10 GB). We can perform update on the data based on the tenant requirement (at max once in a month)
Data size for a day: 1.2 TB
Data life in ES: 1 year
How many tenants do you estimate the cluster will need to support?
If we for simplicity assume the indexed data will take up the same amount of space as the raw data and that you will need a replica for resiliency and availability that is 876TB of data, which will require a quite large cluster.
Yup, we are aware of that. Currently, we are in POC phase of our application and checking the capabilities of AWS managed ES. For now, I am working with 2-3 tenants and 1 GB data on hourly basis.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.