Agg script that returns multiple-documents from 1 document ? (json_each in postgresql)

(ddorian43) #1

I'm recording daily country-level analytics for each url on my site.
But I don't want to store each pageview in a document because that would require too much data + time to process.

So I can store 1 document for each country for each day for each url, but that too would be too much data.

So I come to the conclusion to store 1 document for each url on each day, and keep all the country-stats in there:

{"url": "/", "country": {"CA": {"seconds": 500, "pageviews": 20}, "US": {"pageviews": 5}}}

Is there a way to break this into multiple doucments when I'm searching so that I can do:::

  1. top countries for url in date-range
  2. top countries for all urls
  3. sum(pageviews) for url in country "US"
  4. top(urls) for country "US"
  5. top(urls) (summing the values of all countries together)

The way this is done in postgresql is by using json_each() in a json column .

Is something like that for elasticsearch ? Maybe I need to change the schema a little from what I have to something else (like from nested to an array of {country,pageview}) ?

If this is possible, then I would like to go 1 level further, by storing inside the 'country' object, an array of cities each with it's own pageviews. So then I could do the queries above but also grouping by city?

Thank You!

(Mark Walkom) #2

Having a massive nested document that is constantly updated is not really a good use of Elasticsearch

Why not store an event each day and then just aggregate to get the results, that's more what it's good at.

(ddorian43) #3

Hi warkolm,

Note that the document won't ever be updated, it will just be inserted once at the end of the day (keeping increments happens outside elasticsearch). ( i know that updates are delete+insert --> inefficient).

What I want in this case is to lower the overhead-per-document if I would store it in a non-nested way.

What do you mean by "an event each day" ? WIth event, I understand a pageview, and that means 1 document for each pageview, which it's alot (multimillion pageviews(documents) / day).

(Mark Walkom) #4

A million documents a day is unlikely to be majorly different from one massive document a day. Plus you then don't need to deal with collating that information external to ES.

(ddorian43) #5

I will have 1 document for each distinct url in each day. Just the per-document overhead will be very big. I would love to store raw events but that is costly currently.

Is there a way to do my original request ?

(system) #6