Hi,
I am looking for an efficient way to do inter-document queries in
Elasticsearch. Specifically, I want to count the number of users that went
through an exit point B after visiting point A.
In general terms, say we have some event log data about users actions on a
website:
....
{"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country":"US",
"tstamp":"2013-04-01 00:01:01"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":"CN",
"tstamp":"2013-04-01 00:02:11"}
{"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country":"US",
"tstamp":"2013-04-01 00:06:31"}
{"userid":"abc", "machineid":"110527022", "path":"/page/23", "country":"DE",
"tstamp":"2013-04-01 00:08:00"}
{"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":"CN",
"tstamp":"2013-04-01 00:08:55"}
{"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":"US",
"tstamp":"2013-04-01 00:09:46"}
{"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country":"DE"
, "tstamp":"2013-04-01 00:10:46"}
....
And we have 500+M such entries.
We want a count of the number of userids that visited path=/sale/B after
visiting path=/promo/A.
What I did is to preprocess the data, sorting by <userid, tstamp>, then
compacting all events by the same userid into the same document. Then I
wrote a script filter which traverses the path array per document, and
returns true if it finds any occurrence of B followed by A. This however is
inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
script filter query takes over 300 seconds. Specifically, it can process
events at about 400K events per second. BY comparison, I wrote a naive
program that does a linear pass of the un-compacted data and that process
11M events per second. By which I conclude that Elasticsearch does not do
well on this type of query.
I am hoping someone can indicate a more efficient way to do this query in
ES. Or else confirm that ES cannot do inter-document queries well.
Thanks,
Zennet
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.