Inter-document Queries

Hi,

I am looking for an efficient way to do inter-document queries in
Elasticsearch. Specifically, I want to count the number of users that went
through an exit point B after visiting point A.

In general terms, say we have some event log data about users actions on a
website:
....
{"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country":"US",
"tstamp":"2013-04-01 00:01:01"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":"CN",
"tstamp":"2013-04-01 00:02:11"}
{"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country":"US",
"tstamp":"2013-04-01 00:06:31"}
{"userid":"abc", "machineid":"110527022", "path":"/page/23", "country":"DE",
"tstamp":"2013-04-01 00:08:00"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":"CN",
"tstamp":"2013-04-01 00:08:55"}
{"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":"US",
"tstamp":"2013-04-01 00:09:46"}
{"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country":"DE"
, "tstamp":"2013-04-01 00:10:46"}
....
And we have 500+M such entries.

We want a count of the number of userids that visited path=/sale/B after
visiting path=/promo/A.

What I did is to preprocess the data, sorting by <userid, tstamp>, then
compacting all events by the same userid into the same document. Then I
wrote a script filter which traverses the path array per document, and
returns true if it finds any occurrence of B followed by A. This however is
inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
script filter query takes over 300 seconds. Specifically, it can process
events at about 400K events per second. BY comparison, I wrote a naive
program that does a linear pass of the un-compacted data and that process
11M events per second. By which I conclude that Elasticsearch does not do
well on this type of query.

I am hoping someone can indicate a more efficient way to do this query in
ES. Or else confirm that ES cannot do inter-document queries well.

Thanks,
Zennet

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

This model is not efficient for this type of querying. You cannot do this
in one query using this model, and the pre-processing work you do now +
traversing all documents is very costly.

Is it possible for you to index the data (even as a projection) into
Elasticsearch using a different model, so you can use ES properly using
queries or the aggregations framework?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft zwheatcroft@atypon.com
wrote:

Hi,

I am looking for an efficient way to do inter-document queries in
Elasticsearch. Specifically, I want to count the number of users that went
through an exit point B after visiting point A.

In general terms, say we have some event log data about users actions on a
website:
....
{"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country":
"US", "tstamp":"2013-04-01 00:01:01"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":"CN"
, "tstamp":"2013-04-01 00:02:11"}
{"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country":
"US", "tstamp":"2013-04-01 00:06:31"}
{"userid":"abc", "machineid":"110527022", "path":"/page/23", "country":
"DE", "tstamp":"2013-04-01 00:08:00"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":"CN"
, "tstamp":"2013-04-01 00:08:55"}
{"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":"US"
, "tstamp":"2013-04-01 00:09:46"}
{"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country":
"DE", "tstamp":"2013-04-01 00:10:46"}
....
And we have 500+M such entries.

We want a count of the number of userids that visited path=/sale/B after
visiting path=/promo/A.

What I did is to preprocess the data, sorting by <userid, tstamp>, then
compacting all events by the same userid into the same document. Then I
wrote a script filter which traverses the path array per document, and
returns true if it finds any occurrence of B followed by A. This however is
inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
script filter query takes over 300 seconds. Specifically, it can process
events at about 400K events per second. BY comparison, I wrote a naive
program that does a linear pass of the un-compacted data and that process
11M events per second. By which I conclude that Elasticsearch does not do
well on this type of query.

I am hoping someone can indicate a more efficient way to do this query in
ES. Or else confirm that ES cannot do inter-document queries well.

Thanks,
Zennet

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsCs2LnbYyz5sAc9CLDMqaHYDseQwS8mgsB4PepCsZHpw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Yes. I can re-index the data or transform it in any way to make this query
efficient.

What would you suggest?

On Wednesday, June 4, 2014 2:14:09 PM UTC-7, Itamar Syn-Hershko wrote:

This model is not efficient for this type of querying. You cannot do this
in one query using this model, and the pre-processing work you do now +
traversing all documents is very costly.

Is it possible for you to index the data (even as a projection) into
Elasticsearch using a different model, so you can use ES properly using
queries or the aggregations framework?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft <zwhea...@atypon.com
<javascript:>> wrote:

Hi,

I am looking for an efficient way to do inter-document queries in
Elasticsearch. Specifically, I want to count the number of users that went
through an exit point B after visiting point A.

In general terms, say we have some event log data about users actions on
a website:
....
{"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country":
"US", "tstamp":"2013-04-01 00:01:01"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":
"CN", "tstamp":"2013-04-01 00:02:11"}
{"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country":
"US", "tstamp":"2013-04-01 00:06:31"}
{"userid":"abc", "machineid":"110527022", "path":"/page/23", "country":
"DE", "tstamp":"2013-04-01 00:08:00"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":
"CN", "tstamp":"2013-04-01 00:08:55"}
{"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":
"US", "tstamp":"2013-04-01 00:09:46"}
{"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country":
"DE", "tstamp":"2013-04-01 00:10:46"}
....
And we have 500+M such entries.

We want a count of the number of userids that visited path=/sale/B after
visiting path=/promo/A.

What I did is to preprocess the data, sorting by <userid, tstamp>, then
compacting all events by the same userid into the same document. Then I
wrote a script filter which traverses the path array per document, and
returns true if it finds any occurrence of B followed by A. This however is
inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
script filter query takes over 300 seconds. Specifically, it can process
events at about 400K events per second. BY comparison, I wrote a naive
program that does a linear pass of the un-compacted data and that process
11M events per second. By which I conclude that Elasticsearch does not do
well on this type of query.

I am hoping someone can indicate a more efficient way to do this query in
ES. Or else confirm that ES cannot do inter-document queries well.

Thanks,
Zennet

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You need to be able to form buckets that can be reduced again, either using
the aggregations framework or a query. One model that will allow you to do
that is something like this:

{ "userid": "xyz", "path":"/sale/B", "previous_paths":[...],
"tstamp":"...", ... }

So whenever you add a new path, you denormalize and add previous paths that
could be relevant. This might bloat your storage a bit and be slower on
writes, but it is very optimized for reads since now you can do an
aggregation that queries for the desired "path" and buckets on the user. To
check the condition of the previous path you should be able to bucket again
using a script, or maybe even with a query on a nested type.

This is just from the top of my head but should definitely work if you can
get to that model

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 2:36 AM, Zennet Wheatcroft zwheatcroft@atypon.com
wrote:

Yes. I can re-index the data or transform it in any way to make this query
efficient.

What would you suggest?

On Wednesday, June 4, 2014 2:14:09 PM UTC-7, Itamar Syn-Hershko wrote:

This model is not efficient for this type of querying. You cannot do this
in one query using this model, and the pre-processing work you do now +
traversing all documents is very costly.

Is it possible for you to index the data (even as a projection) into
Elasticsearch using a different model, so you can use ES properly using
queries or the aggregations framework?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft zwhea...@atypon.com
wrote:

Hi,

I am looking for an efficient way to do inter-document queries in
Elasticsearch. Specifically, I want to count the number of users that went
through an exit point B after visiting point A.

In general terms, say we have some event log data about users actions on
a website:
....
{"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country":
"US", "tstamp":"2013-04-01 00:01:01"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":
"CN", "tstamp":"2013-04-01 00:02:11"}
{"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country":
"US", "tstamp":"2013-04-01 00:06:31"}
{"userid":"abc", "machineid":"110527022", "path":"/page/23", "country":
"DE", "tstamp":"2013-04-01 00:08:00"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":
"CN", "tstamp":"2013-04-01 00:08:55"}
{"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":
"US", "tstamp":"2013-04-01 00:09:46"}
{"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country":
"DE", "tstamp":"2013-04-01 00:10:46"}
....
And we have 500+M such entries.

We want a count of the number of userids that visited path=/sale/B after
visiting path=/promo/A.

What I did is to preprocess the data, sorting by <userid, tstamp>, then
compacting all events by the same userid into the same document. Then I
wrote a script filter which traverses the path array per document, and
returns true if it finds any occurrence of B followed by A. This however is
inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
script filter query takes over 300 seconds. Specifically, it can process
events at about 400K events per second. BY comparison, I wrote a naive
program that does a linear pass of the un-compacted data and that process
11M events per second. By which I conclude that Elasticsearch does not do
well on this type of query.

I am hoping someone can indicate a more efficient way to do this query
in ES. Or else confirm that ES cannot do inter-document queries well.

Thanks,
Zennet

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zuvvr%2BssmADu9QxZZHKuQb4ZeEHwtYCsyx7R%2BFtkfpPqA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

A suggestion for the path model:

  • index also the path depth, and name the fields with the depth level
  • execute a nested aggregation query over the path depth levels

Example doc with path info:

{
"path0" : "promo/A",
"path1" : "sale/B"
...
}

In this doc you know the user went from "promo/A" to "sale/B" (level 0 to
level 1).

You can now aggregate over "path0" first (maybe filter for path0 being
"promo/A" for better efficiency) and then find out in a nested aggregation
over "path1" the next steps that were visited.

This is a realistic model unless the path depth is very large, but in
reality, nobody clicks into structures deeper than 4 or 5 levels.

Jörg

On Thu, Jun 5, 2014 at 2:01 AM, Itamar Syn-Hershko itamar@code972.com
wrote:

You need to be able to form buckets that can be reduced again, either
using the aggregations framework or a query. One model that will allow you
to do that is something like this:

{ "userid": "xyz", "path":"/sale/B", "previous_paths":[...],
"tstamp":"...", ... }

So whenever you add a new path, you denormalize and add previous paths
that could be relevant. This might bloat your storage a bit and be slower
on writes, but it is very optimized for reads since now you can do an
aggregation that queries for the desired "path" and buckets on the user. To
check the condition of the previous path you should be able to bucket again
using a script, or maybe even with a query on a nested type.

This is just from the top of my head but should definitely work if you can
get to that model

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 2:36 AM, Zennet Wheatcroft zwheatcroft@atypon.com
wrote:

Yes. I can re-index the data or transform it in any way to make this
query efficient.

What would you suggest?

On Wednesday, June 4, 2014 2:14:09 PM UTC-7, Itamar Syn-Hershko wrote:

This model is not efficient for this type of querying. You cannot do
this in one query using this model, and the pre-processing work you do now

  • traversing all documents is very costly.

Is it possible for you to index the data (even as a projection) into
Elasticsearch using a different model, so you can use ES properly using
queries or the aggregations framework?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft zwhea...@atypon.com
wrote:

Hi,

I am looking for an efficient way to do inter-document queries in
Elasticsearch. Specifically, I want to count the number of users that went
through an exit point B after visiting point A.

In general terms, say we have some event log data about users actions
on a website:
....
{"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country":
"US", "tstamp":"2013-04-01 00:01:01"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":
"CN", "tstamp":"2013-04-01 00:02:11"}
{"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country":
"US", "tstamp":"2013-04-01 00:06:31"}
{"userid":"abc", "machineid":"110527022", "path":"/page/23", "country":
"DE", "tstamp":"2013-04-01 00:08:00"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":
"CN", "tstamp":"2013-04-01 00:08:55"}
{"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":
"US", "tstamp":"2013-04-01 00:09:46"}
{"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country"
:"DE", "tstamp":"2013-04-01 00:10:46"}
....
And we have 500+M such entries.

We want a count of the number of userids that visited path=/sale/B
after visiting path=/promo/A.

What I did is to preprocess the data, sorting by <userid, tstamp>, then
compacting all events by the same userid into the same document. Then I
wrote a script filter which traverses the path array per document, and
returns true if it finds any occurrence of B followed by A. This however is
inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
script filter query takes over 300 seconds. Specifically, it can process
events at about 400K events per second. BY comparison, I wrote a naive
program that does a linear pass of the un-compacted data and that process
11M events per second. By which I conclude that Elasticsearch does not do
well on this type of query.

I am hoping someone can indicate a more efficient way to do this query
in ES. Or else confirm that ES cannot do inter-document queries well.

Thanks,
Zennet

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zuvvr%2BssmADu9QxZZHKuQb4ZeEHwtYCsyx7R%2BFtkfpPqA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zuvvr%2BssmADu9QxZZHKuQb4ZeEHwtYCsyx7R%2BFtkfpPqA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEjyHqd408-Yc7NCVtcUpdsbQtxyjTsn7yDkBNq_6NUJw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you Itamar and Jörg for your replies.

I followed your suggestion Itamar and it works. Queries that took 300+
seconds are now 400 ms again.

However, this model increases stored space complexity by O(N^2) which is
usually not acceptable. So I would not consider this a general method. It
works because the median length of a user session is about 3. We have
sessions with 100s of events. If the median length of a session were 1000
then this method would no longer work.

Any other ideas or refinements? Or is this the best we can do with
Elasticsearch?

Zennet

On Wednesday, June 4, 2014 5:01:19 PM UTC-7, Itamar Syn-Hershko wrote:

You need to be able to form buckets that can be reduced again, either
using the aggregations framework or a query. One model that will allow you
to do that is something like this:

{ "userid": "xyz", "path":"/sale/B", "previous_paths":[...],
"tstamp":"...", ... }

So whenever you add a new path, you denormalize and add previous paths
that could be relevant. This might bloat your storage a bit and be slower
on writes, but it is very optimized for reads since now you can do an
aggregation that queries for the desired "path" and buckets on the user. To
check the condition of the previous path you should be able to bucket again
using a script, or maybe even with a query on a nested type.

This is just from the top of my head but should definitely work if you can
get to that model

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 2:36 AM, Zennet Wheatcroft <zwhea...@atypon.com
<javascript:>> wrote:

Yes. I can re-index the data or transform it in any way to make this
query efficient.

What would you suggest?

On Wednesday, June 4, 2014 2:14:09 PM UTC-7, Itamar Syn-Hershko wrote:

This model is not efficient for this type of querying. You cannot do
this in one query using this model, and the pre-processing work you do now

  • traversing all documents is very costly.

Is it possible for you to index the data (even as a projection) into
Elasticsearch using a different model, so you can use ES properly using
queries or the aggregations framework?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft zwhea...@atypon.com
wrote:

Hi,

I am looking for an efficient way to do inter-document queries in
Elasticsearch. Specifically, I want to count the number of users that went
through an exit point B after visiting point A.

In general terms, say we have some event log data about users actions
on a website:
....
{"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country":
"US", "tstamp":"2013-04-01 00:01:01"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":
"CN", "tstamp":"2013-04-01 00:02:11"}
{"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country":
"US", "tstamp":"2013-04-01 00:06:31"}
{"userid":"abc", "machineid":"110527022", "path":"/page/23", "country":
"DE", "tstamp":"2013-04-01 00:08:00"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":
"CN", "tstamp":"2013-04-01 00:08:55"}
{"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":
"US", "tstamp":"2013-04-01 00:09:46"}
{"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country"
:"DE", "tstamp":"2013-04-01 00:10:46"}
....
And we have 500+M such entries.

We want a count of the number of userids that visited path=/sale/B
after visiting path=/promo/A.

What I did is to preprocess the data, sorting by <userid, tstamp>, then
compacting all events by the same userid into the same document. Then I
wrote a script filter which traverses the path array per document, and
returns true if it finds any occurrence of B followed by A. This however is
inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
script filter query takes over 300 seconds. Specifically, it can process
events at about 400K events per second. BY comparison, I wrote a naive
program that does a linear pass of the un-compacted data and that process
11M events per second. By which I conclude that Elasticsearch does not do
well on this type of query.

I am hoping someone can indicate a more efficient way to do this query
in ES. Or else confirm that ES cannot do inter-document queries well.

Thanks,
Zennet

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/40ff7263-3084-4ec5-97f2-be6e6be4cb3c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I simplified the actual problem in order to avoid explaining the domain
specific details. Allow me to add back more detail.

We want to be able to search for multiple points of user action, towards a
conversion funnel, and condition on multiple fields. Let's add another
field (response) to the above model:
{.., "path":"/promo/A", "response": 200, ..}
{.., "path":"/page/1", "response": 401, ..}
{.., "path":"/promo/D","response": 200, ..}
{.., "path":"/page/23", "response": 301, ..}
{.., "path":"/page/2", "response": 418, ..}
Let's say we define three points through the conversion funnel:
A: Visited path=/page/1
B: Got response=401 from some path
C: Exited at path=/sale/C

And we want to know how many users did steps A-B-C in that order. If we add
an array prev_response like we did for prev_path, then we can use a term
filter to find documents with term path=/sale/C and prev_path=/page/1 and
prev_response=401. But this will not distinguish between A->B->C and
B->A->C. Perhaps I could use the script filter for the "last mile" and from
the term filtered results throw out B-A-C and it will run more quickly
because of the reduced document set.

Is there another way to implement this query?

Zennet

On Wednesday, June 4, 2014 5:01:19 PM UTC-7, Itamar Syn-Hershko wrote:

You need to be able to form buckets that can be reduced again, either
using the aggregations framework or a query. One model that will allow you
to do that is something like this:

{ "userid": "xyz", "path":"/sale/B", "previous_paths":[...],
"tstamp":"...", ... }

So whenever you add a new path, you denormalize and add previous paths
that could be relevant. This might bloat your storage a bit and be slower
on writes, but it is very optimized for reads since now you can do an
aggregation that queries for the desired "path" and buckets on the user. To
check the condition of the previous path you should be able to bucket again
using a script, or maybe even with a query on a nested type.

This is just from the top of my head but should definitely work if you can
get to that model

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 2:36 AM, Zennet Wheatcroft <zwhea...@atypon.com
<javascript:>> wrote:

Yes. I can re-index the data or transform it in any way to make this
query efficient.

What would you suggest?

On Wednesday, June 4, 2014 2:14:09 PM UTC-7, Itamar Syn-Hershko wrote:

This model is not efficient for this type of querying. You cannot do
this in one query using this model, and the pre-processing work you do now

  • traversing all documents is very costly.

Is it possible for you to index the data (even as a projection) into
Elasticsearch using a different model, so you can use ES properly using
queries or the aggregations framework?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft zwhea...@atypon.com
wrote:

Hi,

I am looking for an efficient way to do inter-document queries in
Elasticsearch. Specifically, I want to count the number of users that went
through an exit point B after visiting point A.

In general terms, say we have some event log data about users actions
on a website:
....
{"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country":
"US", "tstamp":"2013-04-01 00:01:01"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":
"CN", "tstamp":"2013-04-01 00:02:11"}
{"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country":
"US", "tstamp":"2013-04-01 00:06:31"}
{"userid":"abc", "machineid":"110527022", "path":"/page/23", "country":
"DE", "tstamp":"2013-04-01 00:08:00"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":
"CN", "tstamp":"2013-04-01 00:08:55"}
{"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":
"US", "tstamp":"2013-04-01 00:09:46"}
{"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country"
:"DE", "tstamp":"2013-04-01 00:10:46"}
....
And we have 500+M such entries.

We want a count of the number of userids that visited path=/sale/B
after visiting path=/promo/A.

What I did is to preprocess the data, sorting by <userid, tstamp>, then
compacting all events by the same userid into the same document. Then I
wrote a script filter which traverses the path array per document, and
returns true if it finds any occurrence of B followed by A. This however is
inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
script filter query takes over 300 seconds. Specifically, it can process
events at about 400K events per second. BY comparison, I wrote a naive
program that does a linear pass of the un-compacted data and that process
11M events per second. By which I conclude that Elasticsearch does not do
well on this type of query.

I am hoping someone can indicate a more efficient way to do this query
in ES. Or else confirm that ES cannot do inter-document queries well.

Thanks,
Zennet

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28daf926-7126-4ad1-87eb-0cc931f11fea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Together with Zennet we brainstormed a solution building on top of Itamar's
proposal.

In one string field we append the current path to the all previous ones and
since we are talking about funnels we need to store them only on the last
event/document generated, e.g SessionEndedEvent.
Then we can use regex pattern matching to identify if the sequence of steps
can be found anywhere in the stored paths string. This solution appears to
be extremely fast.

On Wednesday, June 11, 2014 1:14:59 AM UTC+3, Zennet Wheatcroft wrote:

I simplified the actual problem in order to avoid explaining the domain
specific details. Allow me to add back more detail.

We want to be able to search for multiple points of user action, towards a
conversion funnel, and condition on multiple fields. Let's add another
field (response) to the above model:
{.., "path":"/promo/A", "response": 200, ..}
{.., "path":"/page/1", "response": 401, ..}
{.., "path":"/promo/D","response": 200, ..}
{.., "path":"/page/23", "response": 301, ..}
{.., "path":"/page/2", "response": 418, ..}
Let's say we define three points through the conversion funnel:
A: Visited path=/page/1
B: Got response=401 from some path
C: Exited at path=/sale/C

And we want to know how many users did steps A-B-C in that order. If we
add an array prev_response like we did for prev_path, then we can use a
term filter to find documents with term path=/sale/C and prev_path=/page/1
and prev_response=401. But this will not distinguish between A->B->C and
B->A->C. Perhaps I could use the script filter for the "last mile" and from
the term filtered results throw out B-A-C and it will run more quickly
because of the reduced document set.

Is there another way to implement this query?

Zennet

On Wednesday, June 4, 2014 5:01:19 PM UTC-7, Itamar Syn-Hershko wrote:

You need to be able to form buckets that can be reduced again, either
using the aggregations framework or a query. One model that will allow you
to do that is something like this:

{ "userid": "xyz", "path":"/sale/B", "previous_paths":[...],
"tstamp":"...", ... }

So whenever you add a new path, you denormalize and add previous paths
that could be relevant. This might bloat your storage a bit and be slower
on writes, but it is very optimized for reads since now you can do an
aggregation that queries for the desired "path" and buckets on the user. To
check the condition of the previous path you should be able to bucket again
using a script, or maybe even with a query on a nested type.

This is just from the top of my head but should definitely work if you
can get to that model

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 2:36 AM, Zennet Wheatcroft zwhea...@atypon.com
wrote:

Yes. I can re-index the data or transform it in any way to make this
query efficient.

What would you suggest?

On Wednesday, June 4, 2014 2:14:09 PM UTC-7, Itamar Syn-Hershko wrote:

This model is not efficient for this type of querying. You cannot do
this in one query using this model, and the pre-processing work you do now

  • traversing all documents is very costly.

Is it possible for you to index the data (even as a projection) into
Elasticsearch using a different model, so you can use ES properly using
queries or the aggregations framework?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft <zwhea...@atypon.com

wrote:

Hi,

I am looking for an efficient way to do inter-document queries in
Elasticsearch. Specifically, I want to count the number of users that went
through an exit point B after visiting point A.

In general terms, say we have some event log data about users actions
on a website:
....
{"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country"
:"US", "tstamp":"2013-04-01 00:01:01"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":
"CN", "tstamp":"2013-04-01 00:02:11"}
{"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country"
:"US", "tstamp":"2013-04-01 00:06:31"}
{"userid":"abc", "machineid":"110527022", "path":"/page/23", "country"
:"DE", "tstamp":"2013-04-01 00:08:00"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":
"CN", "tstamp":"2013-04-01 00:08:55"}
{"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":
"US", "tstamp":"2013-04-01 00:09:46"}
{"userid":"abc", "machineid":"110527022 ", "path":"/promo/A",
"country":"DE", "tstamp":"2013-04-01 00:10:46"}
....
And we have 500+M such entries.

We want a count of the number of userids that visited path=/sale/B
after visiting path=/promo/A.

What I did is to preprocess the data, sorting by <userid, tstamp>,
then compacting all events by the same userid into the same document. Then
I wrote a script filter which traverses the path array per document, and
returns true if it finds any occurrence of B followed by A. This however is
inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
script filter query takes over 300 seconds. Specifically, it can process
events at about 400K events per second. BY comparison, I wrote a naive
program that does a linear pass of the un-compacted data and that process
11M events per second. By which I conclude that Elasticsearch does not do
well on this type of query.

I am hoping someone can indicate a more efficient way to do this query
in ES. Or else confirm that ES cannot do inter-document queries well.

Thanks,
Zennet

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ad876869-e280-4b5d-b405-7aa8e88c6094%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi, only saw this now

I wouldn't worry too much about high space complexity - storage comes cheap
nowadays, and the general practice in many systems is to store raw data and
do processing on demand (most commonly known approach is event sourcing).

I can understand an argument about high space complexity being a problem
when this is not the core of your business, and in those cases I'd indeed
try to find a way to store the data in different ways leveraging the
various advanced query types Elasticsearch offers - like the RegEx pattern
matching solution suggested by Theo

HTH,

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Wed, Jul 2, 2014 at 5:04 PM, Theo Harris tkampour@gmail.com wrote:

Together with Zennet we brainstormed a solution building on top of
Itamar's proposal.

In one string field we append the current path to the all previous ones
and since we are talking about funnels we need to store them only on the
last event/document generated, e.g SessionEndedEvent.
Then we can use regex pattern matching to identify if the sequence of
steps can be found anywhere in the stored paths string. This solution
appears to be extremely fast.

On Wednesday, June 11, 2014 1:14:59 AM UTC+3, Zennet Wheatcroft wrote:

I simplified the actual problem in order to avoid explaining the domain
specific details. Allow me to add back more detail.

We want to be able to search for multiple points of user action, towards
a conversion funnel, and condition on multiple fields. Let's add another
field (response) to the above model:
{.., "path":"/promo/A", "response": 200, ..}
{.., "path":"/page/1", "response": 401, ..}
{.., "path":"/promo/D","response": 200, ..}
{.., "path":"/page/23", "response": 301, ..}
{.., "path":"/page/2", "response": 418, ..}
Let's say we define three points through the conversion funnel:
A: Visited path=/page/1
B: Got response=401 from some path
C: Exited at path=/sale/C

And we want to know how many users did steps A-B-C in that order. If we
add an array prev_response like we did for prev_path, then we can use a
term filter to find documents with term path=/sale/C and prev_path=/page/1
and prev_response=401. But this will not distinguish between A->B->C and
B->A->C. Perhaps I could use the script filter for the "last mile" and from
the term filtered results throw out B-A-C and it will run more quickly
because of the reduced document set.

Is there another way to implement this query?

Zennet

On Wednesday, June 4, 2014 5:01:19 PM UTC-7, Itamar Syn-Hershko wrote:

You need to be able to form buckets that can be reduced again, either
using the aggregations framework or a query. One model that will allow you
to do that is something like this:

{ "userid": "xyz", "path":"/sale/B", "previous_paths":[...],
"tstamp":"...", ... }

So whenever you add a new path, you denormalize and add previous paths
that could be relevant. This might bloat your storage a bit and be slower
on writes, but it is very optimized for reads since now you can do an
aggregation that queries for the desired "path" and buckets on the user. To
check the condition of the previous path you should be able to bucket again
using a script, or maybe even with a query on a nested type.

This is just from the top of my head but should definitely work if you
can get to that model

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 2:36 AM, Zennet Wheatcroft zwhea...@atypon.com
wrote:

Yes. I can re-index the data or transform it in any way to make this
query efficient.

What would you suggest?

On Wednesday, June 4, 2014 2:14:09 PM UTC-7, Itamar Syn-Hershko wrote:

This model is not efficient for this type of querying. You cannot do
this in one query using this model, and the pre-processing work you do now

  • traversing all documents is very costly.

Is it possible for you to index the data (even as a projection) into
Elasticsearch using a different model, so you can use ES properly using
queries or the aggregations framework?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft <
zwhea...@atypon.com> wrote:

Hi,

I am looking for an efficient way to do inter-document queries in
Elasticsearch. Specifically, I want to count the number of users that went
through an exit point B after visiting point A.

In general terms, say we have some event log data about users actions
on a website:
....
{"userid":"xyz", "machineid":"110530745", "path":"/promo/A",
"country":"US", "tstamp":"2013-04-01 00:01:01"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country"
:"CN", "tstamp":"2013-04-01 00:02:11"}
{"userid":"xyz", "machineid":"110530745", "path":"/promo/D",
"country":"US", "tstamp":"2013-04-01 00:06:31"}
{"userid":"abc", "machineid":"110527022", "path":"/page/23",
"country":"DE", "tstamp":"2013-04-01 00:08:00"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country"
:"CN", "tstamp":"2013-04-01 00:08:55"}
{"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country"
:"US", "tstamp":"2013-04-01 00:09:46"}
{"userid":"abc", "machineid":"110527022 ", "path":"/promo/A",
"country":"DE", "tstamp":"2013-04-01 00:10:46"}
....
And we have 500+M such entries.

We want a count of the number of userids that visited path=/sale/B
after visiting path=/promo/A.

What I did is to preprocess the data, sorting by <userid, tstamp>,
then compacting all events by the same userid into the same document. Then
I wrote a script filter which traverses the path array per document, and
returns true if it finds any occurrence of B followed by A. This however is
inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
script filter query takes over 300 seconds. Specifically, it can process
events at about 400K events per second. BY comparison, I wrote a naive
program that does a linear pass of the un-compacted data and that process
11M events per second. By which I conclude that Elasticsearch does not do
well on this type of query.

I am hoping someone can indicate a more efficient way to do this
query in ES. Or else confirm that ES cannot do inter-document queries well.

Thanks,
Zennet

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ad876869-e280-4b5d-b405-7aa8e88c6094%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ad876869-e280-4b5d-b405-7aa8e88c6094%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zt_UUkzKgs_%3D8ROsLwAPrGrrDNmu2nae5s6xetAf%2BWqHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.