Hi,
We are in the middle of an ongoing process for improving our data modeling for Elasticsearch, and we have a few questions that could use some experts advice.
In a simplified way:
A client downloads some data from the server, and we want to track how much data was downloaded per client per file.
The downloaded amount can be either from cache or from the filesystem (Or both), so we also need to differentiate on that.
We can think of two ways to model it:
- Separate events for each source of download:
// one for cache
{
type: 'cache',
client: '123',
file: 'test.rar',
value: 100
}
// one for filesystem
{
type: 'filesystem',
client: '123',
file: 'test.rar',
value: 200
}
- Combine those events to a single one but with separate fields.
// one for cache
{
type: 'download',
client: '123',
file: 'test.rar',
cacheBytes: 100, // might also be 0
filesystemBytes: 100 // might also be 0
}
Obviously the first option will make ES store more events than the second one.
Our main concern is regarding query speed for those events afterwards.
We use Kibana as our main query interface.
If we'll want to graph this data on a timeseries, for the first option we will have a single value
metric, while for the second one we'll have two: cacheBytes
and filesystemBytes
.
And for the query it self we'll have to use type:cache AND file:test.rar
to get only cache data, while for the second one we'll have to do something like: type:download AND file:test.rar AND cacheBytes:>0
or something similar.
We would appreciate any comments and ideas related to whether we should have more events with a single value or less events with multiple keys as values.
Thanks in advance