Hello,
I have a modelisation problem which is fairly complex to me. I have
several solution in mind but each of them have inconvenients.
Here is my usecase :
I have passengers who travel. I want to know frequency of these travels.
Number of passengers by travels frequency. Pretty simple isn't it.
The complexity come from 3 things :
1 - Filter on travels. I want to find passengers following filters that are
on travels, for example: origin, destination, date of the travel
2 - Filter on frequency calculation. I want to calculate frequencies
following other filters. When I find the passenger, I want to count his
travels on an other period of time (for example, frequency of travel during
the last month).
3 - Unicity of passengers. If I have a passengers who I find two times
following filters on travels, I don't want to count his frequency of travel
two times.
I will probably have tens of millions of passengers.
An example of the final chart that I want: number of passengers who are in
a Paris-Marseille travel by frequency of travels during the last month on
all travelshttps://lh5.googleusercontent.com/-oNzwV1qmFjA/UkU3yhyxxeI/AAAAAAAAAdo/UK750NoXucI/s1600/2013-09-27+09_45_51-Microsoft+Excel+-+Classeur1.png
Other examples:
Tom has traveled two times:
24/12/2013 at 08:00:00 from Paris to Marseille
26/12/2013 at 10:00:00 from Marseille to Paris
Bob has traveled 1 time:
24/12/2013 at 12:00:00 from Paris to Marseille
John has traveled 1 time:
24/12/2013 at 08:00:00 from Paris to Marseille
First case:
I want to find travelers who travels the 24/12/2013 from Paris
I want to calculate the frequency of travels on the 12th month
What I want is an histogram with to value :
Frequency of travels: 1, number of passengers: 2
Frequency of travels: 2, number of passengers: 1
In english: two passengers from Paris on 24/12/2013 made one travel during
the 12th month
On passenger made two travels
Second case:
Same thing for travelers, I want to find travelers who travels the
24/12/2013 from Paris
I want to calculate the frequency of travels from Paris during last month
What I will get is this histogram
Frequency of travels: 1, number of passengers: 3
In english: three passengers from Paris made one travel from paris during
the last month
Third case;
I want to find all travelers
I want to calculate frequency on all data
What I will get is:
Frequency of travels: 1, number of passengers: 2
Frequency of travels: 2, number of passengers: 1
and not:
Frequency of travels: 1, number of passengers: 2
Frequency of travels: 2, number of passengers: 2
because of the unicity of travelers
I think you get the idea. Or not.
Solution 1: The easy way : 2 or 3 queries
The idea is straightforward:
One query to get id of passengers following filters on travels
With all ids, count number of travels by ids following filters on frequency.
Aggregate these frequencies by number of passengers.
I could probably use a range facets with a script to count travels for each
passenger and get directly data for my histogram
Problem: -query with potentially millions of ids
Solution 2: The cheated way
Preprocess frequency of travels for each passengers following the perimeter
of frequency filter.
A data could be as follow
{
"filter_on_travels": "value1"
"list_of_passengers" : "1234#4 346345#6 214321#1 54325423#4"
}
Term facet on list_of_passengers get unicity on id. I just have to count
frequencies now.
Problem: -No change in the perimeter of frequency
-Probably a big result (many many ids)
-Process on the result
Solution 3: The parent/child facet way
So the process of the query would be as follow :
I find travels that have filter1, I go to the parent, I count travels dor
this id that follow filter2 and then I use a facet (probably range facet or
personalized facet) to have the histogram that I want.
I'm not sure that's possible...
I still learn how to use elasticsearch and parent/child is one thing that I
don't fully unterstand yet
Solution 4: The parent/child/child way
Parent are travels then passengers then travels done by these passengers.
Filter on travels then filter on travels done by passengers then count.
Problem : -Unicity of passenger
-Billions of documents
So my questions are:
Am I the only one around here that have trouble to agregate frequency with
filter on everything ?
Is there a well-known solution ?
Have you some thought about these differents solutions ?
I think I lost everybody but for those who get there and the others, thanks.
Cheers,
Julien
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.