Usage of large amount of data on Elastic Search

Erdi · August 12, 2020, 1:06pm

Hello,

I would like to ask a question related to usage of elasctic search. I wonder that What your experiences are when you are storing large amounts of data in Elastic Search? Do you suggest this usage for me? Did you face any issue etc...

We have to store our payload message on a disk. Currently, we are writing 1TB data without format on a Isilon disk. If we need to find these data on the disk, it takes so much time and sometimes we are not able to find it by using our script.

I can format the data which we produce and store it with Json format in Elasctic Search and I can find the data by using Rest API of Elastic Search easily. However, I have some fears whether any issue occurs in production.

I would be so glad when you are answering my questions in the following lines.

Can we use Elactic Search with 30TB capacity? Have you experienced this usage?
If we archive the data older than 30 days in an Isilon disk daily, can we reactivate to search the data on Elastic Search. For example, if we need data which stored six months ago on Elastic Search, I am not sure that we can find it easily. Do you have any suggestions?

Thank you in advance,

Steve_Mushero · August 13, 2020, 2:52am

Welcome them to our community!

What format / size is the data, like large numbers of records, which ES handles well, or a small number of enormous files, which is more common on disk systems?

30TB is possible in ES, but you'll need a fair number of nodes, especially if you are indexing 1TB/days, as that's a lot, and all depends quite a bit on how the data arrives, and how you'll search for it, such as by ID or big full-text searches of all 30TB data which is kind painful

There are some sizing docs around, a nice webinar PDF here.

Erdi · August 13, 2020, 7:27am

Hi Steve,

Thank you for the response.

Mostly, the size of each messages is between 1 KB and 2 KB. These sizes are minimum value for this use case. We have experienced that size of some data can be 2 MB.

I shared an example message in the following line. We are using comma for separation of details of messages.

<transaction_time>,-,<client_id>,<operation_name>,<service_name>,<typeOfMessages>,<server_name>,<consumer_code>,<customer_no>,<unique_id>|<payload_messages>

The flow which I designed and implemented for the case is Logstash servers -> Kafka -> Apache Nifi (for transforming messages from text to json) -> Elastic Search

As far as I heard that from my collegue, If I store the data which is formated on ES, the replication of ES and to find the data via Rest API of ES might be easier and faster.

Nobody around me supported usage of ES for this use case. I am afraid that a critical capacity issue occurs in production because of this usage

The other challenge is to get data from the archive of ES.

Thanks in advance for your answers.

Steve_Mushero · August 13, 2020, 9:31am

Someone here with more experience at those scales can offer more direct advice, but my minor thoughts below. And at that size and if you plan on buying support/enterprise licenses, I'm sure Elastic would love to help you with sales support, professional services, etc.

It does sound like a normal use case, LOTS of small messages/docs in JSON and easy to query them, etc. Suggest follow that PDF and of course you'll need considerable disk space, as even with only 1 replica it's 60TB of disk, plus overhead, etc. so like a 100TB disk space cluster - big & expensive, I'd think, but probably smaller and less expensive than your alternatives; nothing is easy at that size.

However if you have a simple way to store, find, and read them now, such as by file name or some type of ISAM system, using a single big SAN; that's likely cheaper than a big distributed flexible and powerful clustered Elasticsearch system - really depends on your use.

That's big enough that I'd think hard to test without trying it, as it's so use-case specific at that size, and will depend a lot on your index strategy, access patterns, and mappings (i.e. don't store more data/fields than you need at that scale).

I'm sure others have better advice, but if you can already generate and feed 2TB/day into such a system, you might build a PoC cluster of say 10 sizable (like 16-64GB RAM) data nodes each with 2-4TB of disk, separate 3 masters and 2-4 ingest nodes, and simple indexes with shards < 50GB and see how it performs with a few days of data - does it choke on your 2TB/day ingest, or answer queries in seconds or minutes, etc. At a week with 10TB of data, lots of things should be obvious if it's easy to setup & you have a bit of $$.

It'll undoubtedly need lots of tuning on how you feed it, batch ingest, mappings, index & shard sizes/splitting, and so on.

Kinda fun, actually and I hope you report your results, issues, and learnings.

Erdi · August 13, 2020, 1:04pm

Thank you Steve for sharing your idea and experience. Of course, I will share my experiences with you, if we test and use ES for this case.

system · September 10, 2020, 1:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.