SaaS monitoring application design

Hi there,

Our initial application version was written on MySQL, but we decided to try out ElasticSearch and so far I'm amazed with the results. My tests were quite simple, basic aggregations for charts/counts but the performance is incredible. However, reading through the docs I still struggle to understand some concepts.

Our application stores requests, queue jobs, queries, database/server metrics and other data.

What I'm struggling to understand is how to efficiently structure some of our data which later can be easily aggregated. Our estimates are that we'll be storing 500/550M documents a month with 3 month retention policy.

Data must be available based on user timezone setting.

A basic request object.

[
    'url'    => 'http://test-app.com/home',
    'method' => 'GET',
    'route'  => [
        'uri'    => '/home',
        'name'   => 'home',
        'action' => 'Closure',
    ],
    'user'   => [
        'id'   => 2,
        'name' => 'Ignas Bernotas',
    ],
    'meta'   => [
        'referrer'   => 'http://test-app.com/login',
        'user-agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36',
        'ip'         => '192.168.10.1',
    ],
    'response' => [
       'code' => 500,
       'content' => ''
    ],
    'queries' => [
     [
       'statement' => 'Select * from users where id = ?',
       'processed_in' => 34.33
     ],
     [
       'statement' => 'Select * from applications where id = ?',
       'processed_in' => 30.2
     ],
    ]
],

There are a couple of questions here:

  • Would it be better to flatten the structure if we want aggregate/search by response code or user id? As I understand it, ES does that by default before passing it down to Lucene.

  • We'd like to pull an aggregated report of requests grouped by route.uri and then aggregate all of the queries by average processed_in time so basically, the report would be something like this:

    • /home (total: 200k)
      • Select * from users where id = ? (total 200k, average response time 30ms)
      • Select * from applications where id = ? (total 200k, average response time 20ms)

    All queries will be separately stored in a different index since we need database reports. Would it be worth using relationships for storing queries in one index and then referencing them in request body? Aggregating by query string doesn't sound efficient.

  • I'm still confused about the amount of shards per index. Our biggest index will be for queries, at the moment we're using the default 5 for all indexes, but will that be enough? Query index will be quite small in terms of disk use, but requests will be a lot larger as you can see here (there might be 20-50 queries per single request).

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.