ELK vs grafana+influxDB

(Clinton Gormley) #15

You are correct. And optimizing the index mapping is definitely the way to go. We ship with defaults that try to make things work out of the box for the new user, but spending some time understanding what you're indexing, how it is indexed, and whether or not you need it would be time well spent.

For the pure metrics use case, disabling _all and _source will save you a significant amount of space, but with the disadvantages that Mark pointed out above: not being able to query a catch-all field, and not being able to reindex your documents. As you've said, these things are not important for the metrics use case.

(Note: as an alternative to disabling _source, 2.0 allows you to choose between faster and better compression, which can be updated on the fly.)

Doc-values by default is also the right way to. In fact it is the default for all fields in 2.0 (except analyzed strings, which are not supported).

The default logstash template adds an analyzed and a not_analyzed (raw) version of every string field, because we can't deduce up front which strings should be treated as keywords and which should be searchable as full-text. Again, this makes things just work out of the box, but it is not optimal.

Choosing the type of string field that you want up front is an easy optimization to make, as long as you know what your documents look like in advance. The two approaches can be combined: add specific mappings for the fields you know about, and rely on dynamic mappings for any fields you introduce later.

Couldn't have put it better myself :slight_smile:

++

I think pipeline aggs are going to transform the types of analytics you can do in Elasticsearch. For those of you not familiar with pipeline aggs: they add the ability to aggregate on the results of other aggregations. For instance, you can:

  • generate a date histogram of the max total new visitors per day, then pipe that into a derivative to see how many were added each day, then pipe that into another derivative to see the growth rate of your user base.
  • use moving averages to smooth your data so that you can see general trends instead of noisy data
  • use moving averages to calculate the 30/60 day average, and use Holt Winters to predict your future 30/60 day averages
  • use bucket scripts to produce a new metric based on one or more other series, eg to calculate the percentage of sessions which performed a particular action
  • use serial differencing to remove seasonal or weekly trends
  • etc...

We've focused on the most important pipeline aggs for now, but we'd rather get real user feedback about what is missing, rather than just implementing a bunch of fancy stuff which may not be useful in the wild.

I haven't followed Grafana, which was originally a fork of Kibana 3. What features have they added which we should be adding to Kibana?

(Elvar) #16

The example data you provided is an example of data that should belong in ELK.

(Elvar) #17

I haven't followed Grafana, which was originally a fork of Kibana 3. What features have they added which we should be adding to Kibana?

http://play.grafana.org/

Have fun :smile:

(Yehosef) #18

You can look at their features and playground to see the differences. I have not spent much time looking at the differences personally because it's irrelevant for us now. Grafana doesn't support elasticsearch and we've decided to build our analytics on elasticsearch. But we're currently planning on building our own visualization layer. I'd be happy to share it with you and your team - maybe you can start to move kibana in this direction and we'll all benefit.

(Yehosef) #19

Even though I said we're not looking at grafana - I decided to spend some time with it because we're planning on building our own visualization solution for our problems. And wow, grafana is very nice.

I can see why people say logs to elasticsearch and metrics to influx/graphite - right tool for the right job. I didn't understand it before but now I do.

But it's not exactly how it sounds. The right tool for the right job doesn't have to do with the database/store - but the visualization. Metrics should go to grafana (currently) - it is much more sophisticated for viewing metric type of data. Logging should go to Kibana - grafana wouldn't have any interface for showing or exploring that data.

But there is really nothing I can see on the backend why metrics should go to influx over elasticsearch (once you know how to optimize indices). The issue is that once you put metrics in elasticsearch, you can't analyze them or create dashboards as well as you can in influx.

Hopefully kibana become more flexible and powerful in upcoming releases - elasticsearch deserves it :smile:

(Mark Walkom) #20

This has me confused, before you were saying ES can do this, now you mention it cannot?

(Yehosef) #21

Elasticsearch can handle it fine - but how do I use all that information in elasticsearch? I need some visualization tool. That tool is Kibana, which is not currently as powerful as Grafana (IMO). So the "problem" with elasticsearch has nothing to do with elasticsearch - it's that once I put my data into elasticsearch, I don't have off-the-shelf tools to visualize it as well as I can with InfluxDB (because data there can be visualized with Grafana)

The beauty of elasticsearch is that it can really handle both problems very well. I can easily put it "metrics" type information or "logging" type information - or just logging information and extract metrics. EG, using Grafana, I can only say "show me the graph of logins where the browser was chrome" if I stored that metric from the beginning (eg, metrics.login.chrome) then of course, if I want to see the chart of the different actions that chrome browsers make, I would have had to have stored that differently (metrics.chrome.*actions) With elasticsearch, I can just store my events and build visualization and ask questions afterwards. I think that is a huge difference and why we are building our system on elasticsearch.

For simple metrics that have no real properties (eg many os level properties - system load, file system usage) you don't need this extra power so it's much more valuable to put the measurements in a tool that gives you more power in visualization (eg grafana). But even here, I think there is much room for making richer logging and storing in ES. EG, for each measurement time frame, put all the different metrics in one record (eg {time:1436251626,load:3,disk_usage:"23GB",swap:"12MB",pages_in:20,processes:[...]}. You can still build your metrics that same as usual (with the exception you need to use kibana instead of grafana) but you also have much more power. "Show me the top 5 processes when the load is over 2". Instead of seeing that there is a load problem and then trying to dig into which process is causing the problem, you can ask the question directly.

It's important to point out, that there is nothing inherent in kibana that makes it so it can't be more powerful. As Clinton pointed out, Grafana is a fork of kibana 3. I assume that kibana 4 was a major technological shift and the features are still catching up. I hope that process will be fast and we'll have a visualization tool that's as flexible, powerful and amazing as the storage engine powering it.

(Yehosef) #22

also FYI - other people also would like to use grafana with ES - https://github.com/grafana/grafana/issues/1034
looks like there are people that are working on a connector and there is a ES-graphite shim at https://github.com/distributed-system-analysis/es-graphite-shim

The problem is that grafana doesn't support more analytical, less metric-y visualizations. At least you could have the data in one source, but you'd still have to build two visualization dashboards. Hopefully that problem will go away as kibana matures.

(Mark Walkom) #23

Ok, that makes sense :slight_smile:

#24

Currently, I use both of them in production. Grafana for storing metric data (key with numerical value) like CPU, Memory, I/O utilities, Application status, etc. While, Elasticsearch is used to store application logs like syslog, nginx log, even elasticsearch log, etc. Generally, it contain text and numeric value that need to analyzed.

For visualization feature like graphic and dashboard, I like much more graphic in Grafana than Kibana. Some feature of Grafana graphic that enhance my experience in navigation:

  • Tooltip, to show all value on same X crosshair
  • Auto-scale Y-Bar when select particular series
  • Multiple Y-Axes to compare detil two series or more with high difference value
  • Annotation on graph,
  • Tags on dasboard for facilitate grouping and searching many dashboard

I hope next release of Kibana provide powerful visualization feature like in Grafana.

(Felix Barnsteiner) #25

@elvarb i did some measurements and I can't confirm that Elasticsearch uses much more space than InfluxDB. In fact, if you optimize the index (this should only be done on indices that are not receiving updates anymore), the consumed space by Elasticsearch is even way lower.

I've sent a week worth of datapoints (1 report/minute) to both InfluxDB and Elasticsearch.

Results first:
Elasticsearch after optimize: 425.8M
InfluxDB: 965.2M
Elasticsearch without optimize: 2.3G

Data sent
Each second, 1000 Timers (16 datapoints) 2000 Meters (5 datapoints) and 3000 Gauges where reported. The test included 1008 reports, which is 23,184,000 datapoints in total.

The distribution of the different metric types (timers/meters/gauges) is derived from a real word monitoring setup with stagemonitor which had about 100 timers (16 datapoints/timer), 200 meters (5 datapoints/meter) and 300 gauges (1 datapoint/meter) = 2300 datapoints. If you set the stagemonitor reporting interval to 1 minute you get 23,184,000 datapoints per week (2300 * 60 * 24 * 7)

Example Timer:
response_time,application=Metrics\ Store\ Benchmark,host=N51,instance=instance count=1,m1_rate=3.0,m5_rate=4.0,m15_rate=5.0,mean_rate=2.0,min=4.0,max=2.0,mean=4.0,median=6.0,std=5.0,p25=0.0,p75=7.0,p95=8.0,p98=9.0,p99=10.0,p999=11.0
Example Meter:
meter,application=Metrics\ Store\ Benchmark,host=N51,instance=instance count=10,m1_rate=3.0,m5_rate=4.0,m15_rate=5.0,mean_rate=2.0
Example Gauge:
cpu_usage,application=Metrics\ Store\ Benchmark,host=N51,instance=instance value=1.0e-8

I've constantly updated the metrics with random values.

Elasticsearch settings
I've used the default settings and this index template: https://github.com/stagemonitor/stagemonitor/blob/influxdb/stagemonitor-core/src/main/resources/stagemonitor-elasticsearch-metrics-index-template.json

Benchmark
This is the benchmark: https://github.com/stagemonitor/stagemonitor/blob/influxdb/stagemonitor-core/src/test/java/org/stagemonitor/core/metrics/metrics2/MetricsStoreBenchmark.java
To execute it yourself, start an Elasticsearch and InfluxDB server and adjust the URLs.

Versions
Elasticsearch: 1.7.1
InfluxDB: 0.9.2.1

Conclusion
If you're using an optimized mapping and optimize your indices, Elasticsearch is more storage efficient. I did not benchmark reading the data or how scalable the databases are, but the CERN paper seems to indicate that Elasticsearch is also better in this area.

So what is the advantage on InfluxDB then? As far as I can see it is only the better visualisation support -> Grafana (although Kibana 4 is nice as well) and things like continuous queries that you could use to downsample old data to reduce storage space.

But Elasticsearch seems to have better support for sophisticated queries/functions through aggregations (moving averages and hold-winters is in the pipeline for 2.0!). The biggest advantage I see is that it is much more mature and scaleable. InfluxDB's clustering API currently is in alpha state and you shouldn't form a cluster with more than 3 nodes. I doubt that they will ever catch up with Elasticsearch in this area. There is just so much more money and manpower behind Elastic. The relatively new project Beats by Elastic indicates that they will be investing in the timeseries/metrics area. I really hope that Grafana will at some point support Elasticsearch. Otherwise they might loose users that can't use Elasticsearch with it to Kibana.

6 Likes
(Elvar) #26

That could very well be the flaw in my research, forgot to optimize.

If elasticsearch uses that much less space than influxdb the compression of old data feature in elasticsearch 2 definitely makes elasticsearch a very good platform for metrics.

What it lacks are better visualization tools and some way to downsample old data.

For projects like Bosun that supports various data sources being able to focus on only one is a huge plus.

Would be interested to get input from the guys at Stack Exchange (creators of Bosun) and Netflix (creators of Atlas). Both tools are built around a metric solution and they definitely did their own research into what metric platform was best for them.

(Mark Walkom) #27

Awesome stuff @Felix_4 :smiley:

(Yehosef) #28

For all those interested, the next release of Grafana will support Elasticsearch.

(Renewelches) #29

I am wondering if anyone made a comparison of write performance between InfluxDB and ES?

We are currently storing metrics and log data in ES + Kibana. Our problem is that we are hitting a write limit of 30k entries per second to ES. Which is not sufficient for the amount of data flowing in.

Due to the amount of metrics and logs we are storing, we are considering to split up metrics (InfluxDB+Grafana) and use ES+Kibana solely for storing logs.

Any thoughts on write throughput of ES vs. InfluxDB, e.g something like @Felix_4 has posted?

(Felix Barnsteiner) #30

Have you tried clustering Elasticsearch?

(Yehosef) #31

@Felix_4 - have you tried influx .10?
https://influxdata.com/blog/announcing-influxdb-v0-10-100000s-writes-per-second-better-compression/ sounds promising. I'd be very interested in seeing the tests you ran before with the new version.

(Felix Barnsteiner) #32

Indeed sounds very promising. Did not try yet though.

Clustering is still marked as experimental

I'll probably try again when they've solved clustering.

(Yehosef) #33

Obviously your choice, but a few points to consider.

If I can now ingest 300k/s on a single machine with influx and 30k/s on a single ES node, if I need 100k/s I'll need a 3-4 node ES cluster but only one influx machine. So I won't need to clustering for influx (except for failover, but that's often a much simpler problem.) Also - the clustering will let you scale from 300k/s per machine to 1M/s for a 4 machine cluster - but the per-node performance is still relative and comparable.

Also, your test didn't seem to include anything about clustering - to test database size and throughput, a cluster should not be needed.

1 Like
(Renewelches) #34

We have 12 data nodes, 3 master nodes and 3 client nodes., so yes we are clustering ES.
Each data node has a 100GB disk and 32GB heap. So we are talking already about a fairly huge setup.
Our cluster settings are

number_of_shards: 7
number_of_replicas: 1
index_refresh_interval: 30s
mlockall: true
discovery_minimum_master: 2
discovery_ping_timeout: 60s
gateway_expected_nodes: 18
index_translog_flush_threshold_size: 1g
indices_memory_index_buffer_size: 50%
threadpool_bulk_queue_size: 3000
indices_store_throttle_max_bytes_per_sec: 20mb
indices_cluster_send_refresh_mapping: false
script.disable_dynamic: true