You are correct. And optimizing the index mapping is definitely the way to go. We ship with defaults that try to make things work out of the box for the new user, but spending some time understanding what you're indexing, how it is indexed, and whether or not you need it would be time well spent.
For the pure metrics use case, disabling _all
and _source
will save you a significant amount of space, but with the disadvantages that Mark pointed out above: not being able to query a catch-all field, and not being able to reindex your documents. As you've said, these things are not important for the metrics use case.
(Note: as an alternative to disabling _source
, 2.0 allows you to choose between faster and better compression, which can be updated on the fly.)
Doc-values by default is also the right way to. In fact it is the default for all fields in 2.0 (except analyzed strings, which are not supported).
The default logstash template adds an analyzed and a not_analyzed (raw) version of every string field, because we can't deduce up front which strings should be treated as keywords and which should be searchable as full-text. Again, this makes things just work out of the box, but it is not optimal.
Choosing the type of string field that you want up front is an easy optimization to make, as long as you know what your documents look like in advance. The two approaches can be combined: add specific mappings for the fields you know about, and rely on dynamic mappings for any fields you introduce later.
Couldn't have put it better myself
++
I think pipeline aggs are going to transform the types of analytics you can do in Elasticsearch. For those of you not familiar with pipeline aggs: they add the ability to aggregate on the results of other aggregations. For instance, you can:
- generate a date histogram of the max total new visitors per day, then pipe that into a derivative to see how many were added each day, then pipe that into another derivative to see the growth rate of your user base.
- use moving averages to smooth your data so that you can see general trends instead of noisy data
- use moving averages to calculate the 30/60 day average, and use Holt Winters to predict your future 30/60 day averages
- use bucket scripts to produce a new metric based on one or more other series, eg to calculate the percentage of sessions which performed a particular action
- use serial differencing to remove seasonal or weekly trends
- etc...
We've focused on the most important pipeline aggs for now, but we'd rather get real user feedback about what is missing, rather than just implementing a bunch of fancy stuff which may not be useful in the wild.
I haven't followed Grafana, which was originally a fork of Kibana 3. What features have they added which we should be adding to Kibana?