ES data modeling


(Shahar Mor) #1

Hi,

We are in the middle of an ongoing process for improving our data modeling for Elasticsearch, and we have a few questions that could use some experts advice.

In a simplified way:

A client downloads some data from the server, and we want to track how much data was downloaded per client per file.
The downloaded amount can be either from cache or from the filesystem (Or both), so we also need to differentiate on that.

We can think of two ways to model it:

  1. Separate events for each source of download:
// one for cache
{
  type: 'cache',
  client: '123',
  file: 'test.rar',
  value: 100
}

// one for filesystem
{
  type: 'filesystem',
  client: '123',
  file: 'test.rar',
  value: 200
}
  1. Combine those events to a single one but with separate fields.
// one for cache
{
  type: 'download',
  client: '123',
  file: 'test.rar',
  cacheBytes: 100, // might also be 0
  filesystemBytes: 100 // might also be 0
}

Obviously the first option will make ES store more events than the second one.
Our main concern is regarding query speed for those events afterwards.
We use Kibana as our main query interface.
If we'll want to graph this data on a timeseries, for the first option we will have a single value metric, while for the second one we'll have two: cacheBytesand filesystemBytes.
And for the query it self we'll have to use type:cache AND file:test.rar to get only cache data, while for the second one we'll have to do something like: type:download AND file:test.rar AND cacheBytes:>0 or something similar.

We would appreciate any comments and ideas related to whether we should have more events with a single value or less events with multiple keys as values.

Thanks in advance


(Mark Walkom) #2

First one should compress pretty well though, and using filters will be pretty quick to check.
As for the the second one, you could use a scripted field in KB to add the two values and present a unified one.

So to answer you, why not try both and see what works best?


(Imran Siddique) #3

how many %age of time will u be reading from both? If that is say 80%, then 2nd format indeed saves u. Otherwise (for example it is say 20%), 2nd option doesn't save much. 2nd option will anyways help u in future if u want to do analytics like within a session how much %of data gets read from cache v/s file...


(system) #4