Analytics system hits & visits


(Alex P-2) #1

Hi,

I am building an analytics engine for documents. I am storing logs on each
hit.

I need a histogram of "hits" but also of "visits", the visits can be
deduced by a session_id.

So multiple hits have the same session_id.

The mappings looks like:
{
"docHit" : {
"properties" : {
"doc_id" : {"type" : "long", "index" : "not_analyzed"},
"section_id" : {"type" : "string", "index" : "not_analyzed"},
...
}
}
}

So I can return a histogram of "hits" for a particular document:
{
"query": {
"term": {
"doc_id": 444
}
},
"facets": {
"hits_per_day": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
}
}
}
}

But how do I do the same for "visits"?

If I wanted to get total visits for a document I could try:
{
"query": {
"term": {
"doc_id": 444
}
},
"facets": {
"visits": {
"terms": {
"field": "session_uid",
size: 0
}
}
}
}

Which would return:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0.0,
"hits": []
},
"facets": {
"visits": {
"_type": "terms",
"missing": 0,
"total": 6,
"other": 0,
"terms": [{
"term": "26A1473FFBF2CC5E3A8FCC9BF2240241",
"count": 4
}, {
"term": "3EC91387409740A1429676BB2A9CE02D",
"count": 2
}]
}
}
}

But I would need to return ALL terms and compute the length of
facets.visits.terms, which would be stupidly slow.

Is there a straight forward way to tackle this use case?

Thanks!

Alex


(Radu Gheorghe) #2

Hi Alex,

WARNING: this is sort of a "I'm having the same issue" reply. Not quite,
but sort of.

If I understand correctly, you want to know how many unique sessions you
got, right?

In that case you kinda need to count all the sessions, like you do in the
stupidly slow facet you mentioned. Whether the facet is the best option I
don't know, but you have to go through all of them somehow. I'm having a
similar issue, and the only workaround I found is to do these facets
incrementally. That is, because I have huge loads of data and I can't
afford to do a facet on all of it.

So what I do is:

  • keep a separate index/type where I intend to store my unique items
  • run a script regularly which will get a facet like the one you mentioned,
    but only for newly added data since the last run (I'm basing this on
    timestamp)
  • take the unique terms from there and insert them in the "unique items"
    type
    • because I have no idea whether any item is there already, I use another
      dirty hack: use the unique item itself as an ID and insert them with
      ?op_type=create. So if it's already there it will give an error instead of
      inserting the same item twice
  • finally, when I want to do a statistic with those unique items I just run
    it on that "unique items" type. For example, if I want all my unique items
    since a certain date, it's just a matter of filtering by date and getting
    the number of hits.

If anyone has a better option, I'd be delighted to hear it.

On Friday, August 10, 2012 7:13:42 PM UTC+3, Alex P wrote:

Hi,

I am building an analytics engine for documents. I am storing logs on each
hit.

I need a histogram of "hits" but also of "visits", the visits can be
deduced by a session_id.

So multiple hits have the same session_id.

The mappings looks like:
{
"docHit" : {
"properties" : {
"doc_id" : {"type" : "long", "index" : "not_analyzed"},
"section_id" : {"type" : "string", "index" : "not_analyzed"},
...
}
}
}

So I can return a histogram of "hits" for a particular document:
{
"query": {
"term": {
"doc_id": 444
}
},
"facets": {
"hits_per_day": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
}
}
}
}

But how do I do the same for "visits"?

If I wanted to get total visits for a document I could try:
{
"query": {
"term": {
"doc_id": 444
}
},
"facets": {
"visits": {
"terms": {
"field": "session_uid",
size: 0
}
}
}
}

Which would return:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0.0,
"hits": []
},
"facets": {
"visits": {
"_type": "terms",
"missing": 0,
"total": 6,
"other": 0,
"terms": [{
"term": "26A1473FFBF2CC5E3A8FCC9BF2240241",
"count": 4
}, {
"term": "3EC91387409740A1429676BB2A9CE02D",
"count": 2
}]
}
}
}

But I would need to return ALL terms and compute the length of
facets.visits.terms, which would be stupidly slow.

Is there a straight forward way to tackle this use case?

Thanks!

Alex

--


(Alex P-2) #3

Thanks Radu!

That's really interesting.

I came up with my own little hack (which has some similarities to yours),
I'm basically storing the data twice....

So I have a "hits" index, and then a "sessions" index. The sessions index
has the exact same data, except it only has 1 entry per session. As opposed
to having several entries per session (for the hits index)

This means I'm duplicating data but now I can query for sessions just as
easily as I can for hits. Which is necessary so I can find sessions per
device, per country, per IP etc...

And hopefully, if ES supports a better solutions, my "hits" index is intact
and I can happily destroy all the "sessions" indexes.

On Saturday, 11 August 2012 21:21:48 UTC+1, Radu Gheorghe wrote:

Hi Alex,

WARNING: this is sort of a "I'm having the same issue" reply. Not quite,
but sort of.

If I understand correctly, you want to know how many unique sessions you
got, right?

In that case you kinda need to count all the sessions, like you do in the
stupidly slow facet you mentioned. Whether the facet is the best option I
don't know, but you have to go through all of them somehow. I'm having a
similar issue, and the only workaround I found is to do these facets
incrementally. That is, because I have huge loads of data and I can't
afford to do a facet on all of it.

So what I do is:

  • keep a separate index/type where I intend to store my unique items
  • run a script regularly which will get a facet like the one you
    mentioned, but only for newly added data since the last run (I'm basing
    this on timestamp)
  • take the unique terms from there and insert them in the "unique items"
    type
    • because I have no idea whether any item is there already, I use
      another dirty hack: use the unique item itself as an ID and insert them
      with ?op_type=create. So if it's already there it will give an error
      instead of inserting the same item twice
  • finally, when I want to do a statistic with those unique items I just
    run it on that "unique items" type. For example, if I want all my unique
    items since a certain date, it's just a matter of filtering by date and
    getting the number of hits.

If anyone has a better option, I'd be delighted to hear it.

On Friday, August 10, 2012 7:13:42 PM UTC+3, Alex P wrote:

Hi,

I am building an analytics engine for documents. I am storing logs on
each hit.

I need a histogram of "hits" but also of "visits", the visits can be
deduced by a session_id.

So multiple hits have the same session_id.

The mappings looks like:
{
"docHit" : {
"properties" : {
"doc_id" : {"type" : "long", "index" : "not_analyzed"},
"section_id" : {"type" : "string", "index" : "not_analyzed"},
...
}
}
}

So I can return a histogram of "hits" for a particular document:
{
"query": {
"term": {
"doc_id": 444
}
},
"facets": {
"hits_per_day": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
}
}
}
}

But how do I do the same for "visits"?

If I wanted to get total visits for a document I could try:
{
"query": {
"term": {
"doc_id": 444
}
},
"facets": {
"visits": {
"terms": {
"field": "session_uid",
size: 0
}
}
}
}

Which would return:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0.0,
"hits": []
},
"facets": {
"visits": {
"_type": "terms",
"missing": 0,
"total": 6,
"other": 0,
"terms": [{
"term": "26A1473FFBF2CC5E3A8FCC9BF2240241",
"count": 4
}, {
"term": "3EC91387409740A1429676BB2A9CE02D",
"count": 2
}]
}
}
}

But I would need to return ALL terms and compute the length of
facets.visits.terms, which would be stupidly slow.

Is there a straight forward way to tackle this use case?

Thanks!

Alex

--


(Otis Gospodnetić) #4

Hi Alex,

I see Radu is already helping. :slight_smile:
Plug: Should you want to use a search analytics service, have a look at

Otis

On Friday, August 10, 2012 12:13:42 PM UTC-4, Alex P wrote:

Hi,

I am building an analytics engine for documents. I am storing logs on each
hit.

I need a histogram of "hits" but also of "visits", the visits can be
deduced by a session_id.

So multiple hits have the same session_id.

The mappings looks like:
{
"docHit" : {
"properties" : {
"doc_id" : {"type" : "long", "index" : "not_analyzed"},
"section_id" : {"type" : "string", "index" : "not_analyzed"},
...
}
}
}

So I can return a histogram of "hits" for a particular document:
{
"query": {
"term": {
"doc_id": 444
}
},
"facets": {
"hits_per_day": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
}
}
}
}

But how do I do the same for "visits"?

If I wanted to get total visits for a document I could try:
{
"query": {
"term": {
"doc_id": 444
}
},
"facets": {
"visits": {
"terms": {
"field": "session_uid",
size: 0
}
}
}
}

Which would return:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0.0,
"hits": []
},
"facets": {
"visits": {
"_type": "terms",
"missing": 0,
"total": 6,
"other": 0,
"terms": [{
"term": "26A1473FFBF2CC5E3A8FCC9BF2240241",
"count": 4
}, {
"term": "3EC91387409740A1429676BB2A9CE02D",
"count": 2
}]
}
}
}

But I would need to return ALL terms and compute the length of
facets.visits.terms, which would be stupidly slow.

Is there a straight forward way to tackle this use case?

Thanks!

Alex

--


(system) #5