# Modeling problem for agregate frequencies, many ideas but each one with problems

Hello,

I have a modelisation problem which is fairly complex to me. I have
several solution in mind but each of them have inconvenients.
Here is my usecase :
I have passengers who travel. I want to know frequency of these travels.
Number of passengers by travels frequency. Pretty simple isn't it.
The complexity come from 3 things :
1 - Filter on travels. I want to find passengers following filters that are
on travels, for example: origin, destination, date of the travel
2 - Filter on frequency calculation. I want to calculate frequencies
following other filters. When I find the passenger, I want to count his
travels on an other period of time (for example, frequency of travel during
the last month).
3 - Unicity of passengers. If I have a passengers who I find two times
following filters on travels, I don't want to count his frequency of travel
two times.

I will probably have tens of millions of passengers.

An example of the final chart that I want: number of passengers who are in
a Paris-Marseille travel by frequency of travels during the last month on
Other examples:
Tom has traveled two times:
24/12/2013 at 08:00:00 from Paris to Marseille
26/12/2013 at 10:00:00 from Marseille to Paris
Bob has traveled 1 time:
24/12/2013 at 12:00:00 from Paris to Marseille
John has traveled 1 time:
24/12/2013 at 08:00:00 from Paris to Marseille

First case:
I want to find travelers who travels the 24/12/2013 from Paris
I want to calculate the frequency of travels on the 12th month
What I want is an histogram with to value :
Frequency of travels: 1, number of passengers: 2
Frequency of travels: 2, number of passengers: 1
In english: two passengers from Paris on 24/12/2013 made one travel during
the 12th month

Second case:
Same thing for travelers, I want to find travelers who travels the
24/12/2013 from Paris
I want to calculate the frequency of travels from Paris during last month
What I will get is this histogram
Frequency of travels: 1, number of passengers: 3
In english: three passengers from Paris made one travel from paris during
the last month

Third case;
I want to find all travelers
I want to calculate frequency on all data
What I will get is:
Frequency of travels: 1, number of passengers: 2
Frequency of travels: 2, number of passengers: 1
and not:
Frequency of travels: 1, number of passengers: 2
Frequency of travels: 2, number of passengers: 2
because of the unicity of travelers

I think you get the idea. Or not.

Solution 1: The easy way : 2 or 3 queries

The idea is straightforward:
One query to get id of passengers following filters on travels
With all ids, count number of travels by ids following filters on frequency.
Aggregate these frequencies by number of passengers.
I could probably use a range facets with a script to count travels for each
passenger and get directly data for my histogram

Problem: -query with potentially millions of ids

Solution 2: The cheated way
Preprocess frequency of travels for each passengers following the perimeter
of frequency filter.

A data could be as follow
{
"filter_on_travels": "value1"
"list_of_passengers" : "1234#4 346345#6 214321#1 54325423#4"
}
Term facet on list_of_passengers get unicity on id. I just have to count
frequencies now.

Problem: -No change in the perimeter of frequency
-Probably a big result (many many ids)
-Process on the result

Solution 3: The parent/child facet way

So the process of the query would be as follow :
I find travels that have filter1, I go to the parent, I count travels dor
this id that follow filter2 and then I use a facet (probably range facet or
personalized facet) to have the histogram that I want.
I'm not sure that's possible...
I still learn how to use elasticsearch and parent/child is one thing that I
don't fully unterstand yet

Solution 4: The parent/child/child way

Parent are travels then passengers then travels done by these passengers.
Filter on travels then filter on travels done by passengers then count.
Problem : -Unicity of passenger
-Billions of documents

So my questions are:
Am I the only one around here that have trouble to agregate frequency with
filter on everything ?
Is there a well-known solution ?
Have you some thought about these differents solutions ?

I think I lost everybody but for those who get there and the others, thanks.

Cheers,

Julien

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.

Hi Julien,

I think I got there. Do you mean something like this:
http://elasticsearch-users.115913.n3.nabble.com/facet-and-grouping-td4020055.html?

In particular, the suggestion in that post at:
http://elasticsearch-users.115913.n3.nabble.com/facet-and-grouping-tp4020055p4020100.html

I wrote my own grouping facet classes in Java, using his suggestions (and
exactly his separator ~~~). There are two things to do: Create the facet
with the script, and then post-process ES's results. Use of the
LinkedHashMap meant that I preserved the relative order of everything that
ES delivered to me.

So if I arrived at the same place you meant, this means you have a lot of
Java ahead of you. But it was worth the effort; I got very nice results. I
hear that ES 1.0 will include something like this, but my own solution is
working very well.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.

Hello,

I think ES 1.0 could lead to some improvement. I'll just have to do a range
facet on a count facet using filter with a parent/child relation.

For the moment I think I will use a parent/child relation and a range
facet. I will compute frequency for each passenger before. Use passengers
as parent and travels as child and find passengers using an has_child
filter.
Using this method I'll lose the possibility of defining filters for
frequency computation.

Your solution could lead to client-side processing, and I'm not sure that's
a good idea given the number of passengers that I could have. I think I'll
compare this two approaches.

Thanks,

Julien

On Friday, September 27, 2013 7:00:10 PM UTC+2, InquiringMind wrote:

Hi Julien,

I think I got there. Do you mean something like this:
http://elasticsearch-users.115913.n3.nabble.com/facet-and-grouping-td4020055.html?

In particular, the suggestion in that post at:
http://elasticsearch-users.115913.n3.nabble.com/facet-and-grouping-tp4020055p4020100.html

I wrote my own grouping facet classes in Java, using his suggestions (and
exactly his separator ~~~). There are two things to do: Create the facet
with the script, and then post-process ES's results. Use of the
LinkedHashMap meant that I preserved the relative order of everything that
ES delivered to me.

So if I arrived at the same place you meant, this means you have a lot of
Java ahead of you. But it was worth the effort; I got very nice results. I
hear that ES 1.0 will include something like this, but my own solution is
working very well.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.

Hi, Julien

I think ES 1.0 could lead to some improvement. I'll just have to do a range

facet on a count facet using filter with a parent/child relation.

That's what I read. I just couldn't wait!

For the moment I think I will use a parent/child relation and a range
facet. I will compute frequency for each passenger before. Use passengers
as parent and travels as child and find passengers using an has_child
filter.
Using this method I'll lose the possibility of defining filters for
frequency computation.

that's a good idea given the number of passengers that I could have. I
think I'll compare this two approaches.

Well, the link suggests a Javascript solution which would likely reside in
the client, though an intermediate NodeJS instance would also reduce the