ELK vs grafana+influxDB

yehosef · July 7, 2015, 6:42am

Elasticsearch can handle it fine - but how do I use all that information in elasticsearch? I need some visualization tool. That tool is Kibana, which is not currently as powerful as Grafana (IMO). So the "problem" with elasticsearch has nothing to do with elasticsearch - it's that once I put my data into elasticsearch, I don't have off-the-shelf tools to visualize it as well as I can with InfluxDB (because data there can be visualized with Grafana)

The beauty of elasticsearch is that it can really handle both problems very well. I can easily put it "metrics" type information or "logging" type information - or just logging information and extract metrics. EG, using Grafana, I can only say "show me the graph of logins where the browser was chrome" if I stored that metric from the beginning (eg, metrics.login.chrome) then of course, if I want to see the chart of the different actions that chrome browsers make, I would have had to have stored that differently (metrics.chrome.*actions) With elasticsearch, I can just store my events and build visualization and ask questions afterwards. I think that is a huge difference and why we are building our system on elasticsearch.

For simple metrics that have no real properties (eg many os level properties - system load, file system usage) you don't need this extra power so it's much more valuable to put the measurements in a tool that gives you more power in visualization (eg grafana). But even here, I think there is much room for making richer logging and storing in ES. EG, for each measurement time frame, put all the different metrics in one record (eg {time:1436251626,load:3,disk_usage:"23GB",swap:"12MB",pages_in:20,processes:[...]}. You can still build your metrics that same as usual (with the exception you need to use kibana instead of grafana) but you also have much more power. "Show me the top 5 processes when the load is over 2". Instead of seeing that there is a load problem and then trying to dig into which process is causing the problem, you can ask the question directly.

It's important to point out, that there is nothing inherent in kibana that makes it so it can't be more powerful. As Clinton pointed out, Grafana is a fork of kibana 3. I assume that kibana 4 was a major technological shift and the features are still catching up. I hope that process will be fast and we'll have a visualization tool that's as flexible, powerful and amazing as the storage engine powering it.

yehosef · July 7, 2015, 6:54am

also FYI - other people also would like to use grafana with ES - https://github.com/grafana/grafana/issues/1034
looks like there are people that are working on a connector and there is a ES-graphite shim at https://github.com/distributed-system-analysis/es-graphite-shim

The problem is that grafana doesn't support more analytical, less metric-y visualizations. At least you could have the data in one source, but you'd still have to build two visualization dashboards. Hopefully that problem will go away as kibana matures.

warkolm · July 7, 2015, 10:08pm

Ok, that makes sense

junior_h · July 11, 2015, 11:44pm

Currently, I use both of them in production. Grafana for storing metric data (key with numerical value) like CPU, Memory, I/O utilities, Application status, etc. While, Elasticsearch is used to store application logs like syslog, nginx log, even elasticsearch log, etc. Generally, it contain text and numeric value that need to analyzed.

For visualization feature like graphic and dashboard, I like much more graphic in Grafana than Kibana. Some feature of Grafana graphic that enhance my experience in navigation:

Tooltip, to show all value on same X crosshair
Auto-scale Y-Bar when select particular series
Multiple Y-Axes to compare detil two series or more with high difference value
Annotation on graph,
Tags on dasboard for facilitate grouping and searching many dashboard

I hope next release of Kibana provide powerful visualization feature like in Grafana.

felixbarny · August 19, 2015, 4:44pm

@elvarb i did some measurements and I can't confirm that Elasticsearch uses much more space than InfluxDB. In fact, if you optimize the index (this should only be done on indices that are not receiving updates anymore), the consumed space by Elasticsearch is even way lower.

I've sent a week worth of datapoints (1 report/minute) to both InfluxDB and Elasticsearch.

Results first:
Elasticsearch after optimize: 425.8M
InfluxDB: 965.2M
Elasticsearch without optimize: 2.3G

Data sent
Each second, 1000 Timers (16 datapoints) 2000 Meters (5 datapoints) and 3000 Gauges where reported. The test included 1008 reports, which is 23,184,000 datapoints in total.

The distribution of the different metric types (timers/meters/gauges) is derived from a real word monitoring setup with stagemonitor which had about 100 timers (16 datapoints/timer), 200 meters (5 datapoints/meter) and 300 gauges (1 datapoint/meter) = 2300 datapoints. If you set the stagemonitor reporting interval to 1 minute you get 23,184,000 datapoints per week (2300 * 60 * 24 * 7)

Example Timer:
response_time,application=Metrics\ Store\ Benchmark,host=N51,instance=instance count=1,m1_rate=3.0,m5_rate=4.0,m15_rate=5.0,mean_rate=2.0,min=4.0,max=2.0,mean=4.0,median=6.0,std=5.0,p25=0.0,p75=7.0,p95=8.0,p98=9.0,p99=10.0,p999=11.0
Example Meter:
meter,application=Metrics\ Store\ Benchmark,host=N51,instance=instance count=10,m1_rate=3.0,m5_rate=4.0,m15_rate=5.0,mean_rate=2.0
Example Gauge:
cpu_usage,application=Metrics\ Store\ Benchmark,host=N51,instance=instance value=1.0e-8

I've constantly updated the metrics with random values.

Elasticsearch settings
I've used the default settings and this index template: https://github.com/stagemonitor/stagemonitor/blob/influxdb/stagemonitor-core/src/main/resources/stagemonitor-elasticsearch-metrics-index-template.json

Benchmark
This is the benchmark: https://github.com/stagemonitor/stagemonitor/blob/influxdb/stagemonitor-core/src/test/java/org/stagemonitor/core/metrics/metrics2/MetricsStoreBenchmark.java
To execute it yourself, start an Elasticsearch and InfluxDB server and adjust the URLs.

Versions
Elasticsearch: 1.7.1
InfluxDB: 0.9.2.1

Conclusion
If you're using an optimized mapping and optimize your indices, Elasticsearch is more storage efficient. I did not benchmark reading the data or how scalable the databases are, but the CERN paper seems to indicate that Elasticsearch is also better in this area.

So what is the advantage on InfluxDB then? As far as I can see it is only the better visualisation support -> Grafana (although Kibana 4 is nice as well) and things like continuous queries that you could use to downsample old data to reduce storage space.

But Elasticsearch seems to have better support for sophisticated queries/functions through aggregations (moving averages and hold-winters is in the pipeline for 2.0!). The biggest advantage I see is that it is much more mature and scaleable. InfluxDB's clustering API currently is in alpha state and you shouldn't form a cluster with more than 3 nodes. I doubt that they will ever catch up with Elasticsearch in this area. There is just so much more money and manpower behind Elastic. The relatively new project Beats by Elastic indicates that they will be investing in the timeseries/metrics area. I really hope that Grafana will at some point support Elasticsearch. Otherwise they might loose users that can't use Elasticsearch with it to Kibana.

elvarb · August 19, 2015, 5:30pm

That could very well be the flaw in my research, forgot to optimize.

If elasticsearch uses that much less space than influxdb the compression of old data feature in elasticsearch 2 definitely makes elasticsearch a very good platform for metrics.

What it lacks are better visualization tools and some way to downsample old data.

For projects like Bosun that supports various data sources being able to focus on only one is a huge plus.

Would be interested to get input from the guys at Stack Exchange (creators of Bosun) and Netflix (creators of Atlas). Both tools are built around a metric solution and they definitely did their own research into what metric platform was best for them.

warkolm · August 20, 2015, 1:02am

Awesome stuff @Felix_4

yehosef · September 3, 2015, 3:37pm

For all those interested, the next release of Grafana will support Elasticsearch.

renewelches · February 6, 2016, 5:56pm

I am wondering if anyone made a comparison of write performance between InfluxDB and ES?

We are currently storing metrics and log data in ES + Kibana. Our problem is that we are hitting a write limit of 30k entries per second to ES. Which is not sufficient for the amount of data flowing in.

Due to the amount of metrics and logs we are storing, we are considering to split up metrics (InfluxDB+Grafana) and use ES+Kibana solely for storing logs.

Any thoughts on write throughput of ES vs. InfluxDB, e.g something like @Felix_4 has posted?

felixbarny · February 6, 2016, 9:09pm

Have you tried clustering Elasticsearch?

yehosef · February 7, 2016, 8:59am

@Felix_4 - have you tried influx .10?
https://influxdata.com/blog/announcing-influxdb-v0-10-100000s-writes-per-second-better-compression/ sounds promising. I'd be very interested in seeing the tests you ran before with the new version.

felixbarny · February 7, 2016, 9:16am

Indeed sounds very promising. Did not try yet though.

Clustering is still marked as experimental

I'll probably try again when they've solved clustering.

yehosef · February 7, 2016, 9:30am

Obviously your choice, but a few points to consider.

If I can now ingest 300k/s on a single machine with influx and 30k/s on a single ES node, if I need 100k/s I'll need a 3-4 node ES cluster but only one influx machine. So I won't need to clustering for influx (except for failover, but that's often a much simpler problem.) Also - the clustering will let you scale from 300k/s per machine to 1M/s for a 4 machine cluster - but the per-node performance is still relative and comparable.

Also, your test didn't seem to include anything about clustering - to test database size and throughput, a cluster should not be needed.

renewelches · February 7, 2016, 9:54am

We have 12 data nodes, 3 master nodes and 3 client nodes., so yes we are clustering ES.
Each data node has a 100GB disk and 32GB heap. So we are talking already about a fairly huge setup.
Our cluster settings are

number_of_shards: 7
number_of_replicas: 1
index_refresh_interval: 30s
mlockall: true
discovery_minimum_master: 2
discovery_ping_timeout: 60s
gateway_expected_nodes: 18
index_translog_flush_threshold_size: 1g
indices_memory_index_buffer_size: 50%
threadpool_bulk_queue_size: 3000
indices_store_throttle_max_bytes_per_sec: 20mb
indices_cluster_send_refresh_mapping: false
script.disable_dynamic: true

yehosef · February 7, 2016, 10:13am

Seems like a reasonable configuration - You're using SSD or disk?

You're doing bulk inserts I assume. Have you played with the size of the payload?

You might want to decrease the heap to 31 just to make sure you don't accidentally hit the 32GB ceiling and fly into uncompressed pointer-land.

We have a 3-machine cluster, each with 32GB RAM (16GB heap) and around 1TB SSD and we can get at least 30k for bulk inserts so I definitely would have expected more with more data nodes.

I would be interested in hearing more stories about benchmarking and tuning ingestion performance - it's also a question I have about whether it's possible to get more out of the same hardware with config tweaking.

warkolm · February 9, 2016, 1:03am

If you want to chat about performance on your cluster then it might be better to start another thread and then link us to it. That way we can provide specific assistance while still retaining the flow (and feel) of this thread

felixbarny · February 14, 2016, 11:15am

Some real world data from a stagemonitor installation on 88 tomcats: >320M datapoints and 9,11GB per day

azhukov · February 25, 2017, 5:34pm

Hello everybody.

I didn't see any links on these benchmarking tests in this topic, so I'll post it here. It is made by InfluxData staff, but it is something we should take into account. The test methodology is open and can be reproducible by anyone.

May 12, 2016 InfluxDB v0.13.0 vs Elasticsearch v5.0.0-alpha1
https://www.influxdata.com/influxdb-markedly-elasticsearch-in-time-series-data-metrics-benchmark/

Conclusion: InfluxDB outperformed Elasticsearch in all three tests with 8x greater write throughput, while using 4x less disk space when compared against Elastic’s time series optimized configuration, and delivering 3.5x to 7.5x faster response times for tested queries.

September, 2016 InfluxDB v1.0.0 vs Elasticsearch v5.0.0-alpha5 https://www.influxdata.com/resources/benchmarking-influxdb-vs-elasticsearch-for-time-series/

It's a similar test and the similar result:
InfluxDB outperformed Elasticsearch by 8x when it came to data ingestion
InfluxDB outperformed Elasticsearch by delivering 4x and 16x better compression
InfluxDB outperformed Elasticsearch by 4x to 10x when measuring query performance

It's really impressive with such InfluxDB superiority.

The big difference between ElasticSearch and InfuxDB is HA solution they have provided and for what functionality we should pay.

You can create a full ES cluster without any functional restriction for free. InfluxData force you to buy a subscription if you want to use InfluxEnterprise (shard clustering). Otherwise without paying only one InfluxDB instance can work with the same metadata (For duplicating data on the other server we're using free Influx Relay).

But with 8x more data ingestion rate do we really need InfluxDB clustering? It seems that if you exceed the single InfluxDB instance capacity you can create another one somewhere else and redirect part of the payload to the new single instance. Single instance can manage < 250 thousand writes per second, it's a number! https://docs.influxdata.com/influxdb/v1.2/guides/hardware_sizing/

Two cents about clustering:

ES is pretty mature, all tasks are doing automatically, you don't need to do anything if you increase replica count or after one of the node downtime. With free Monitoring feature we can always know what's going on in ES cluster.

InfluxEnterprise looks like a junior child, after increasing replica count and adding new node we manually should do Cluster Rebalance: make copy of old shards to the new node using command line. And it is so inconvenient if we have thousands of shards and leading to human error. So I can't recommend to use it today, it definitely needs polishing up. And cool dashboard with metrics about cluster also for money, in free version we have _internal database but there's no description about this.

My conclusion: pure time-series data should be in InfluxDB (or other TSDB, such as ATSD) and searchable text data in ElasticSearch in general. God bless, Grafana, we can draw graphs from InfluxDB / ES / etc. on the same dashboard.

But it depends on your environment. You can use Elastic as TSDB, but it's a matter of time you bump into ES performance. To each his own. Using full-text search engine just for time-series may be overkill.

Metrics begin to create a wide market and there're a lot of players on it. ES was used to full-text-search, but trying to compete with others. See all of Beats utility.

sudeepkumarnair · March 30, 2017, 12:58am

Our benchmarking exercise also suggested for high cardinality support influxDB is not yet there