If only the cost problem, I will address the matter, maybe I will query the data of the last 14 days only, but it is also considered big data, so it does not matter the cost if the search is fast, even if the RAM size reaches 1T
To answer the original question: we've demonstrated that a single node can query 1PiB of data. It wasn't fast, nor something that'd really be appropriate in a production environment, but nor was that really the limit for a single node in pure storage-capacity terms.
However, the storage costs alone for a dataset of this size will be over $1M per year, even before you start to think about RAM, CPU, network transit and miscellaneous other resource costs. At that kind of scale even tiny optimizations will have a massive impact on the total costs, and you will need to do some extremely careful analysis and design to choose the optimal path.
Yet, this is a free community forum, we can't reasonably do that level of analysis and design for you. If you're not comfortable with designing this kind of system yourself then I think you need to spend some of your $MM/yr budget on professional advice in this area rather than relying on the volunteers here. It'll save you money in the long run to do this properly.
I'll do that, for sure, but I'm in a comprehensive study phase now to assess what I need for that.
If the consultants you're planning to use need you to find out the answer to the question in this thread up front, I strongly recommend you find different consultants.
Our cold nodes currently have about 20Tb of data. I think they used to have more, but I don't have history that far back. Heap is OK, it "idles" at about 1 garbage collection per hour, but search activity can drive that up. I think it could handle 2x that data in our environment. (Heap sized is 31G)
In prior versions where more heap was used, we had problems around 15Tb. (Long ago, maybe version 5 or 6?)
None of this may apply to your environment, that is Elastic.
On a related note, I submitted a ticket to our internal tracker for technical writing to write a guide based on some of these questions.
Thank you
This will be very and everyone will get to benefit
I did not understand do you mean that one node can hold 20Tb or more?
I did a storage test for a webinar all the way back in 2018. Already then I recall pushing it to over 15TB per node for a test data set. A lot of improvements have implemented since and most of the limiting factors I encountered with respect to heap usage have been removed. This is why nodes can nowadays hold very large amounts of data without being limited by heap pressure. I read @rugenl comment to mean that they currently have 20TB per node and that this meet their performance requirements. In test environments it seems they for their data set have managed to push it to 40TB per node, which is not that surprising.
The important thing to note is that this is for their use case and data and that it does meet their performance requirements. Their use case is not described, but in my experience a lot of use cases with high node densities rely on data analysis through aggregations rather than searching for raw documents. Their use case is likely to be quite different for yours, so I would not just take this and start making sizing calculations based on it. You will need to test to see what applies for your use case, data, queries and hardware.
You have not provided much detail about your use case but have been asking questions about retrieving large amount of documents efficiently through searches and the ability of Elasticsearch to handle high levels of concurrent queries. Based on this I would guess that your use case is likely to have significantly different performance characteristics.
Our use case for the large cold nodes is "retention to make auditors happy". Now these would probably be good as searchable snapshot frozen nodes now.
Thank you for the beautiful and very useful tips
Yes, maybe the use case is completely different or even a little bit the data that I expect is very large so I need to have the entry and search as fast as possible
I have now done the study that one node accommodates 30 terabytes and the RAM is 64 Gigabytes and the CPU is 12
Would this be appropriate?
Yes, if the data on the hot layer is always very expensive and also somewhat slow
Therefore, I plan to use storage in the hot layer for 14 days only, then the data is passed to other layers, which have the type of storage HDD
Is Qnap NAS storage suitable for storing data?
It Dependsā¢. For some use-cases that'd be fine, possibly even generous. For others it will be woefully inadequate. At the scale you're proposing there are no simple answers, you will need to run your own experiments. We recommend you use Rally to do that.
It Dependsā¢. Technically Elasticsearch will work on any storage that acts like a local filesystem, but some NAS storage does not do that correctly and this can cause data loss. Even NAS systems which do correctly behave like a local filesystem will often have unacceptably slow performance.
Ok
I'll use it and do the experiment and test.
Ok
What kind would you recommend?
I guess it depends on your use case. If it is indeed a search use case the official guide around tuning for search speed recommends using local SSDs. This is also recommended in the guide around tuning for indexing speed, as indexing can be very I/O intensive.
Yes, I read it.
But it will be expensive, so I am thinking of making only hot nodes in this type of storage for 14 day sand then I put them on less expensive storage
And I asked about "Qnap NAS" and the reply was slow
So what is better than the same type so that I store data after 14 days have passed
For example Manasi to store data in it which is in the cold layer
Are you most frequently going to query just the last 14 days worth of data? How often will you be querying the data that you move to slower storage?
The search in the last 14 days will have pressure
As for the data in the HDD storage, it will be searched in the week or two weeks or in case of review or need