ELK vs grafana+influxDB

@elvarb - thanks for the info. I'm a little confused as to the differences between logs and metrics. The way I understand it, logs make metrics. If you cut out all the detail to one number per time frame (eg, logins per day) then that will be small and fast but you've lost much of the detail in being able to slice and dice it (eg how does that break down by traffic sources or visitor type, actions done, etc.)

And if I want to condense it down into one number like that for ELK, I can create metrics from myself or in logstash (https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.html).

Do you have more detail about what's considered logs and what's metrics in the BI context?

Also - for your storage comparison, did you tweak your mapping? We're using https://gist.github.com/yehosef/f96dc491bcd5ee9bf7d3#file-template-config and we've got 250M rows on a 16G machine with about 70GB data (originally about 300-400GB raw json). I don't know how this would compare with influx - do you have some specific numbers for comparison.

Outside of moving averages, and storage requirements, do you have specific things that influx+grafana can do that EK doesn't?

You're right, it's simple to grab metrics from many application logs, however somethings like lower level OS metrics aren't that simple to get. We're definitely aiming to cover the latter with [Beats](https://www.elastic.co/products/beats].

However I currently monitor my VPC with collectd+ELK, it's probably overkill for a 2vCPU/2GB VM, but it works a treat!

Just on the tweaked mapping, disabling _source means you cannot reindex your data, that may not be a problem for you. Disabling _all means you lose your search shortcut and you need to be specific on what field to search, that may not be a problem if you are doing metric level dashboards, but it won't be best for all uses.

Agreed with "right tool for the right job". While Elasticsearch is a more general solution, compared to time-series databases, it may make sense for certain time-series workloads, and we hear many reports of our users using it as a time-series DB.

One case in point, here is one recent independent evaluation from CERN that was presented at 21st International Conference on Computing in High Energy and Nuclear Physics in April 2015: http://cds.cern.ch/record/2011172/files/LHCb-TALK-2015-060.pdf

They compared Elasticsearch to InfluxDB and OpenTSDB and found that Elasticsearch scaled better than InfluxDB and OpenTSDB for their high-scale time-series analytics use case.

See detailed performance benchmarks in the paper.

2 Likes

@warkolm - about OS metrics - that's just issues about the reporting. there are plenty of ways of getting that information and if graphite has that built in, it'll be easier. But that's orthogonal to elasticsearch and influxDB.

About the mapping tweaks - you're right. You can't reindex and you can't do global searches. But, unless you're doing "logging", that doesn't really matter. If you use the standard mapping every string field will be analyzed (lowercase and tokenized, IIRC) which will add a huge amount of space. And unless that's what you want, it makes many of the typical analytics work actually more difficult. Eg. any field like "keywords" become unusable unless there are no spaces. I think it would be helpful if Elastic published more recommendations for template mappings based on different use-cases. The main problem we have with our mapping is that we can't use the "discover" part of kibana because those are not exposed. We're planning on keeping a small subset of the data (last 30 days) with traditional mapping in additional to our main "optimized" indices. This will let us use "discover" to build queries, and then run then on the other indices.

The question I'm trying to understand and answer is, what does "The right tool for the right job" mean in this context. @elvarb said "Logs go to Elasticsearch and metrics to infuxDB" But I'm still trying to understand what influxDB would give us that Elasticsearch doesn't, especially if you understand how to tweak ES.

@tbragin, Thanks for the link to the CERN paper. It's interesting that they couldn't get influx to scale.

Currently the most compelling point for me for influxdb is grafana. But I'm still trying to understand more about it's strong points in comparison to ES.

Great idea regarding mapping use cases, I'll take that back internally and see if we can get some material out on it.

There are a few factors on why I think going with InfluxDB for metrics is better than using ELK (for now, might change with Elasticsearch 2.0, and that the CERN paper made me have doubts as well)

  1. The whole ecosystem around Graphite metrics is huge, you can pick from loads of tools to gather the metrics, loads of tools to aggregate the metrics, loads of tools to store the metrics, loads of tools to view the metrics and loads of tools to monitor the metrics.
  2. Of all those options InfluxDB solves the aggregation and storage in a very easy to use and easy to manage package. Most other options require more work to get working and to mantain.
  3. Grafana is in my opinion by far the best metric visualization tool available, works on Graphtie, InfluxDB and OpenTSDB. There is an open ticket on the Grafana project about adding Elasticsearch support as well but I doubt they will start work on that untill Elasticsearch 2.0 is released.
  4. By using the Graphite format you can replace nearly every piece with a different solution so its very future proof. Metrics 2.0 (basically context support with tags) will be a game changer for everyone but its too early to worry about that now, only a handful of solutions support it.
  5. Storing metrics in InfluxDB takes a lot less space than in Elasticsearch. Think about how the Graphite format works, you have a namespace of "sitename.hostname.appname.subname.metricname", a value and a timestamp. In metrics databases that namespace is stored once and the value and timestamp for each data point. In Elasticsearch you would have to store it all for each data point + you would have to analyze the namspace field so you can query it.

In the end, do a test. Pick a metric, send it to both ELK and InfluxDB for a week. Evaluate the disk space usage of both. Test viewing the metrics in Kibana for ELK and Grafana for InfluxDB. This will give you a solid feel for all angles.

1 Like

thanks - some feedback about your points:

  1. The ecosystem around ELK is shaping up pretty rapidly - but it could be you're right that there is more tooling around graphite now.

  2. Doesn't say how influx is better than elasticsearch - I've read of people building TB elasticsearch clusters - I'm not sure what examples of bigger influx clusters there are.

  3. Agreed. though I think Kibana has tremendous potential and I hope to see them develop it.

  4. could be - not a factor for us.

  5. It's not clear to me how true this is if one optimizes the index mapping, as I mentioned. Also - you don't analyze the field names - just the field contents. If your values are just numbers, there is no analysis. And the field names will be column pointers, if understand correctly, so you're not paying the price for the field/column name like you do in mongo. To get the possible field names, you query against the mapping. someone correct me if this is incorrect.

As an example of sizing - we have data that looks like http://jsonblob.com/55952b0ae4b051e806c87aa1 the index for one day is 373MB for 1.9M records (approx 200B per record and there are about 34 fields so thats about 6 B per field) The advantage is that we can use this both for simple metrics (show me the number of mobile visitors I had by hour on this day) and more complex aggregation (show me the number of mobile visitor by top cities aggregated by browser name by hour on this day). And I can get this from one data source without knowing from the beginning that I want those numbers.

It's possible in influx you could also do this - I'm not sure. But I don't think graphite can. But this is what pulls me towards elasticsearch - the ability to ask and answer questions that I didn't think about when I stored the data.

In the end, you're right. the best approach is to try both tools and see which works best for you. Since ES2 is around the corner - we're probably going to hold off for that since it'll take care of some of the more complex aggregation problems (moving averages..) But thanks for your insights - it's valuable to hear all the details.

You are correct. And optimizing the index mapping is definitely the way to go. We ship with defaults that try to make things work out of the box for the new user, but spending some time understanding what you're indexing, how it is indexed, and whether or not you need it would be time well spent.

For the pure metrics use case, disabling _all and _source will save you a significant amount of space, but with the disadvantages that Mark pointed out above: not being able to query a catch-all field, and not being able to reindex your documents. As you've said, these things are not important for the metrics use case.

(Note: as an alternative to disabling _source, 2.0 allows you to choose between faster and better compression, which can be updated on the fly.)

Doc-values by default is also the right way to. In fact it is the default for all fields in 2.0 (except analyzed strings, which are not supported).

The default logstash template adds an analyzed and a not_analyzed (raw) version of every string field, because we can't deduce up front which strings should be treated as keywords and which should be searchable as full-text. Again, this makes things just work out of the box, but it is not optimal.

Choosing the type of string field that you want up front is an easy optimization to make, as long as you know what your documents look like in advance. The two approaches can be combined: add specific mappings for the fields you know about, and rely on dynamic mappings for any fields you introduce later.

Couldn't have put it better myself :slight_smile:

++

I think pipeline aggs are going to transform the types of analytics you can do in Elasticsearch. For those of you not familiar with pipeline aggs: they add the ability to aggregate on the results of other aggregations. For instance, you can:

  • generate a date histogram of the max total new visitors per day, then pipe that into a derivative to see how many were added each day, then pipe that into another derivative to see the growth rate of your user base.
  • use moving averages to smooth your data so that you can see general trends instead of noisy data
  • use moving averages to calculate the 30/60 day average, and use Holt Winters to predict your future 30/60 day averages
  • use bucket scripts to produce a new metric based on one or more other series, eg to calculate the percentage of sessions which performed a particular action
  • use serial differencing to remove seasonal or weekly trends
  • etc...

We've focused on the most important pipeline aggs for now, but we'd rather get real user feedback about what is missing, rather than just implementing a bunch of fancy stuff which may not be useful in the wild.

I haven't followed Grafana, which was originally a fork of Kibana 3. What features have they added which we should be adding to Kibana?

The example data you provided is an example of data that should belong in ELK.

I haven't followed Grafana, which was originally a fork of Kibana 3. What features have they added which we should be adding to Kibana?

http://play.grafana.org/

Have fun :smile:

You can look at their features and playground to see the differences. I have not spent much time looking at the differences personally because it's irrelevant for us now. Grafana doesn't support Elasticsearch and we've decided to build our analytics on Elasticsearch. But we're currently planning on building our own visualization layer. I'd be happy to share it with you and your team - maybe you can start to move kibana in this direction and we'll all benefit.

Even though I said we're not looking at grafana - I decided to spend some time with it because we're planning on building our own visualization solution for our problems. And wow, grafana is very nice.

I can see why people say logs to elasticsearch and metrics to influx/graphite - right tool for the right job. I didn't understand it before but now I do.

But it's not exactly how it sounds. The right tool for the right job doesn't have to do with the database/store - but the visualization. Metrics should go to grafana (currently) - it is much more sophisticated for viewing metric type of data. Logging should go to Kibana - grafana wouldn't have any interface for showing or exploring that data.

But there is really nothing I can see on the backend why metrics should go to influx over elasticsearch (once you know how to optimize indices). The issue is that once you put metrics in elasticsearch, you can't analyze them or create dashboards as well as you can in influx.

Hopefully kibana become more flexible and powerful in upcoming releases - elasticsearch deserves it :smile:

This has me confused, before you were saying ES can do this, now you mention it cannot?

Elasticsearch can handle it fine - but how do I use all that information in elasticsearch? I need some visualization tool. That tool is Kibana, which is not currently as powerful as Grafana (IMO). So the "problem" with elasticsearch has nothing to do with elasticsearch - it's that once I put my data into elasticsearch, I don't have off-the-shelf tools to visualize it as well as I can with InfluxDB (because data there can be visualized with Grafana)

The beauty of elasticsearch is that it can really handle both problems very well. I can easily put it "metrics" type information or "logging" type information - or just logging information and extract metrics. EG, using Grafana, I can only say "show me the graph of logins where the browser was chrome" if I stored that metric from the beginning (eg, metrics.login.chrome) then of course, if I want to see the chart of the different actions that chrome browsers make, I would have had to have stored that differently (metrics.chrome.*actions) With elasticsearch, I can just store my events and build visualization and ask questions afterwards. I think that is a huge difference and why we are building our system on elasticsearch.

For simple metrics that have no real properties (eg many os level properties - system load, file system usage) you don't need this extra power so it's much more valuable to put the measurements in a tool that gives you more power in visualization (eg grafana). But even here, I think there is much room for making richer logging and storing in ES. EG, for each measurement time frame, put all the different metrics in one record (eg {time:1436251626,load:3,disk_usage:"23GB",swap:"12MB",pages_in:20,processes:[...]}. You can still build your metrics that same as usual (with the exception you need to use kibana instead of grafana) but you also have much more power. "Show me the top 5 processes when the load is over 2". Instead of seeing that there is a load problem and then trying to dig into which process is causing the problem, you can ask the question directly.

It's important to point out, that there is nothing inherent in kibana that makes it so it can't be more powerful. As Clinton pointed out, Grafana is a fork of kibana 3. I assume that kibana 4 was a major technological shift and the features are still catching up. I hope that process will be fast and we'll have a visualization tool that's as flexible, powerful and amazing as the storage engine powering it.

1 Like

also FYI - other people also would like to use grafana with ES - https://github.com/grafana/grafana/issues/1034
looks like there are people that are working on a connector and there is a ES-graphite shim at https://github.com/distributed-system-analysis/es-graphite-shim

The problem is that grafana doesn't support more analytical, less metric-y visualizations. At least you could have the data in one source, but you'd still have to build two visualization dashboards. Hopefully that problem will go away as kibana matures.

Ok, that makes sense :slight_smile:

Currently, I use both of them in production. Grafana for storing metric data (key with numerical value) like CPU, Memory, I/O utilities, Application status, etc. While, Elasticsearch is used to store application logs like syslog, nginx log, even elasticsearch log, etc. Generally, it contain text and numeric value that need to analyzed.

For visualization feature like graphic and dashboard, I like much more graphic in Grafana than Kibana. Some feature of Grafana graphic that enhance my experience in navigation:

  • Tooltip, to show all value on same X crosshair
  • Auto-scale Y-Bar when select particular series
  • Multiple Y-Axes to compare detil two series or more with high difference value
  • Annotation on graph,
  • Tags on dasboard for facilitate grouping and searching many dashboard

I hope next release of Kibana provide powerful visualization feature like in Grafana.

@elvarb i did some measurements and I can't confirm that Elasticsearch uses much more space than InfluxDB. In fact, if you optimize the index (this should only be done on indices that are not receiving updates anymore), the consumed space by Elasticsearch is even way lower.

I've sent a week worth of datapoints (1 report/minute) to both InfluxDB and Elasticsearch.

Results first:
Elasticsearch after optimize: 425.8M
InfluxDB: 965.2M
Elasticsearch without optimize: 2.3G

Data sent
Each second, 1000 Timers (16 datapoints) 2000 Meters (5 datapoints) and 3000 Gauges where reported. The test included 1008 reports, which is 23,184,000 datapoints in total.

The distribution of the different metric types (timers/meters/gauges) is derived from a real word monitoring setup with stagemonitor which had about 100 timers (16 datapoints/timer), 200 meters (5 datapoints/meter) and 300 gauges (1 datapoint/meter) = 2300 datapoints. If you set the stagemonitor reporting interval to 1 minute you get 23,184,000 datapoints per week (2300 * 60 * 24 * 7)

Example Timer:
response_time,application=Metrics\ Store\ Benchmark,host=N51,instance=instance count=1,m1_rate=3.0,m5_rate=4.0,m15_rate=5.0,mean_rate=2.0,min=4.0,max=2.0,mean=4.0,median=6.0,std=5.0,p25=0.0,p75=7.0,p95=8.0,p98=9.0,p99=10.0,p999=11.0
Example Meter:
meter,application=Metrics\ Store\ Benchmark,host=N51,instance=instance count=10,m1_rate=3.0,m5_rate=4.0,m15_rate=5.0,mean_rate=2.0
Example Gauge:
cpu_usage,application=Metrics\ Store\ Benchmark,host=N51,instance=instance value=1.0e-8

I've constantly updated the metrics with random values.

Elasticsearch settings
I've used the default settings and this index template: https://github.com/stagemonitor/stagemonitor/blob/influxdb/stagemonitor-core/src/main/resources/stagemonitor-elasticsearch-metrics-index-template.json

Benchmark
This is the benchmark: https://github.com/stagemonitor/stagemonitor/blob/influxdb/stagemonitor-core/src/test/java/org/stagemonitor/core/metrics/metrics2/MetricsStoreBenchmark.java
To execute it yourself, start an Elasticsearch and InfluxDB server and adjust the URLs.

Versions
Elasticsearch: 1.7.1
InfluxDB: 0.9.2.1

Conclusion
If you're using an optimized mapping and optimize your indices, Elasticsearch is more storage efficient. I did not benchmark reading the data or how scalable the databases are, but the CERN paper seems to indicate that Elasticsearch is also better in this area.

So what is the advantage on InfluxDB then? As far as I can see it is only the better visualisation support -> Grafana (although Kibana 4 is nice as well) and things like continuous queries that you could use to downsample old data to reduce storage space.

But Elasticsearch seems to have better support for sophisticated queries/functions through aggregations (moving averages and hold-winters is in the pipeline for 2.0!). The biggest advantage I see is that it is much more mature and scaleable. InfluxDB's clustering API currently is in alpha state and you shouldn't form a cluster with more than 3 nodes. I doubt that they will ever catch up with Elasticsearch in this area. There is just so much more money and manpower behind Elastic. The relatively new project Beats by Elastic indicates that they will be investing in the timeseries/metrics area. I really hope that Grafana will at some point support Elasticsearch. Otherwise they might loose users that can't use Elasticsearch with it to Kibana.

6 Likes

That could very well be the flaw in my research, forgot to optimize.

If elasticsearch uses that much less space than influxdb the compression of old data feature in elasticsearch 2 definitely makes elasticsearch a very good platform for metrics.

What it lacks are better visualization tools and some way to downsample old data.

For projects like Bosun that supports various data sources being able to focus on only one is a huge plus.

Would be interested to get input from the guys at Stack Exchange (creators of Bosun) and Netflix (creators of Atlas). Both tools are built around a metric solution and they definitely did their own research into what metric platform was best for them.

Awesome stuff @Felix_4 :smiley: