Best practices for dealing with a large number of small activity stream events

(ppearcy) #1

Hi all,
I'm looking at using elasticsearch for a use case that I'd love some
feedback on regarding best practices.

A little background... I've been digging into various approaches to
allowing interactive drill down slicing dicing of activity stream data (
actor / verb / target ) user data for realtime analytics for end users.
This is high dimensional data that has too many potential ways to view to
effectively precompute rollups. Other systems out there that try to tackle
this similar problem that I have played around with are Druid, OpenTSDB,
Blueflood, InfluxDB. At the end of the day they either all use an inverted
index or have or are planning to have elasticsearch integrations, so I
figure why not stick with ES.

There are three areas I am trying to optimize:

  • Minimize the index footprint on disk.
  • Minimize the RAM footprint
  • Maximize the speed

I believe the key tradeoff I need to make with my dataset is going to
doc_values and whether or not I try to utilize heap or page cache.

All my fields are straight exact match not analyzed fields and there are
~15 of them. "not_analyzed" appears to have all the extras that can cause
bloat disabled (norms, frequencies, etc). I am not indexing source. Here is
my index template:

With some test data, I'm getting pretty solid results. Average messages are
~360bytes and I am getting:

  • 60 bytes per without doc_values
  • 80 bytes per with doc values

On a test index with ~160million docs w/o doc values, I have it at 9.6GB of
data with the file breakdown like so:
3.8G Jul 23 09:40 _mwf.fdt
3.9G Jul 23 10:32 _mwf_es090_0.tim
1.8G Jul 23 10:32 _mwf_es090_0.doc

Anybody know how I can slim things down any further or general advice when
dealing with large numbers of small documents?


You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit
For more options, visit

(system) #2