I am trying to build a benchmark on how much storage likely needed for Elasticsearch, below are some stats whenever I sent new records from logstash to Elasticsearch. I checked in Elasticsearch after every interval, (record or hits)
6 record - 11.4KB
12 record - 22.4 KB
18 - 33.4 KB
24 records - 44 KB
48 records - 66,9
120 records - 92.9 KB
I don't see that it's exponentially increasing the storage for the records, how do I calculate if I need to send through 20B records?
The storage required is not linear with the number of documents as the space efficiency of index data structures and compression improve with size for small data volumes, so you need to index a good amount of data before you can estimate the average size per document and extrapolate from there. I would recommend indexing a few million documents and use that to estimate average document size on disk.
This ^^^ is the only way you will get a good estimate!
It can seem a bit weird at low volumes Elastic may not seem super efficient as it is setting up the base data structures etc..etc.... efficiencies are seen at scale.
You might also look at the How To section in the Docs... if disk space is of most concern or search speed there are ways to tune for each
It is 20B records/documents per day and approximately I need to store for a year. Not sure if elasticsearch would behave good with performance. I am doing a pre-study if it would work good.
I need to search on the timestamp to pick specific period data and on regex on the records.
If you need to search based on regexes, make sure you consider the wildcard field type. It will likely take up more space but will likely speed up searches sigificantly as regex queries can be very, very expensive and slow. If you are going to use this, make sure you perform your storage test with this enabled for the required field(s).
Sounds like you will need a very large cluster, or possibly even a number of clusters.
Perhaps give some examples of what your searches look like..
Often folks use the term regex to mean different things...
There is also match_only_text which is like full text search but optimized for logs / machine telemetry etc..
Elasticsearch can definitely handle that volume but the scale and speed is in the details at that scale.
We often see the scale you are talking to use several mid sized cluster with CCS cross cluster search, this architecture tends to provide a nice mix or manageability, scale , flexibility etc.
Also for that volume with long term retention... You would probably want to consider the searchable snapshot capabilities which could dramatically reduce your HW footprint and costs even though it requires a subscription which has costs.
The HW profiles and storage options and amount of storage all have considerations
Lots of things to consider... Let us know where we can help.
In the "record" there is an IP address (10.XX.133.XX) and a value (56560) that I want to query and based on timestamp. There could be multiple such records on an IP address and ports.
I was planning to use the timestamp to query for records in specific period and use the regex to search for the IP Address and Port. Is there any better way to arrange fields and search for those?
Is it better to split fields in the record and have them as separate fields?
It is generally always better to parse out this data into specific fields that are indexed and can be searched much more efficiently than using regular expressions.
Have another question, I read in other blogs that Elasticsearch shouldn’t be used as primary database instead should be a solution to bridge between logstash and other reliable databases. Is that a misconception? Is it ok to keep such huge data in Elasticsearch and rely on it?
Thanks. Will explore more on logstash. So it’s well formatted for searches before sending to Elasticsearch.
Much appreciated for such quick responses and support. I feel ELK is the solution for our current need!
There has been a lot of improvements to resiliency and scalability over the last few years, so I know of a lot of use cases that use Elasticsearch as the primary data store, especially around logging and metrics. Naturally you should in this case treat it as a primary data store and ensure you have replica shards configured, the cluster set up properly and make sure you regularly take backups using the snapshot API, like you would with any other primary database.
That's a big question that will take some investigation and specific use case.
Very clear. One use case is to put elastic as a read-only search layer on top of your RDBMSs to isolate them from high volume, search loads and really quite honestly to protect them or protect you from having to scale many many RDBMSs that's one use case
We are seeing more and customers /use case where they are treating elastic as a primary data store. There are some misconceptions about that and there has been and continuing work on the resiliency and the atomicity of elasticsearch.
One thing for sure that elasticsearch does not support natively is multi-document transactions with rollback. I would even add though for some of those use cases. Some of the things that cause rollback don't exist in elastic....
So it really boil down to what your specific use case is.
Oh and I fixed a typo above. I meant to say the searchable snapshots which is commercial feature. Could actually save you overall. TCO even though the elasticsearch subscription has a cost....
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.