I am new to Elastic Search and was considering to use it. I have some doubts and would appreciate if someone can please help me with them.
A brief introduction of my problem statement:
I have to store and analyse a huge amount of collected data and generate reports/charts to determine different trends/frequency of usage for certain data etc.
Frequency of incoming data is high and over a period of time, the stored data will grow huge. In my understanding, Elastic Search allows aggregations at search time which may cause the report generation to be slow if the amount of data is very large (historical data). We want an alternative to this search time aggregation (Stored aggregations like in a data warehouse) in Elastic Search which in the log run (with historical data) will not slow down the search. Is it possible with Elastic Search to achieve this or using a data warehouse is a better approach in this case?
Any thoughts will be helpful. Thanks in advance.
How slow it is? Did you test it?
How much data are looking to collect and analyse? Huge is a very subjective term, and does not really tell us much.
I have not tested it as of now as right now, I am in the phase of choosing either Elastic Search or a data ware-house whichever will best suit my needs. Just an example, maybe I get 1000 entries from 270 data sources each day. After say, 10 years we will have a considerable amount of data. When I make my search on this amount of data (this is growing day by day), will Elastic Search do justice to the report generation(quick results as it will do search aggregation on such big data) or should I consider Data ware-house for my problem?
Elasticsearch is designed to scale horizontally with increasing invest or total data volume. It is common for even small clusters to ingest hundreds of GB per day, so the volumes you are describing sounds quite small from an Elasticsearch perspective.
Thanks for the reply. Data I mentioned is just an example & not the real data. I agree Elastic Search can ingest hundreds of GB per day but, will it make the report generation slower after a considerable period of time? And, I wanted to avoid aggregations happening at the time when we ask for reports rather I would prefer if the data is aggregated beforehand and then, when we say report generation for filtering some data, at that time it does not compute and use pre-populated result like in data-warehouse (dimensions). So, help on this path will be appreciable.
You can query and aggregate against large volumes of data with good performance in Elasticsearch. Exactly how long this will take will however depend on the type of data you have, what type of queries/aggregations you run as well as the type and amount of hardware available. I know users that aggregate data in Elasticsearch in order to make certain type of queries faster, but I also know users that just run queries against the raw data and see good performance. When you aggregate you will lose some information, which may make it harder to later ask questions you did not realise you needed to ask when you determined how to aggregate.
If you are looking at a 10 year perspective, Elasticsearch as well as hardware will most likely develop quite a lot in that time, meaning that performance characteristics we see today may no longer apply.
Is there any way in ES that would allow to store the counts for queries and later when we want some results involving them, rather than going for calculating all of it, we use the stored counts and only calculate what is remaining? I am using Kibana with ES for reporting/charting and trying to calculate some statistics for the collected data.
If you run a query, you get back a JSON document. Send it to Elasticsearch with a PUT.
But again I'd not try to solve problems I don't have.