Hello,

I have a modelisation problem which is fairly complex to me. I have

several solution in mind but each of them have inconvenients.

Here is my usecase :

I have passengers who travel. I want to know frequency of these travels.

Number of passengers by travels frequency. Pretty simple isn't it.

The complexity come from 3 things :

1 - Filter on travels. I want to find passengers following filters that are

on travels, for example: origin, destination, date of the travel

2 - Filter on frequency calculation. I want to calculate frequencies

following other filters. When I find the passenger, I want to count his

travels on an other period of time (for example, frequency of travel during

the last month).

3 - Unicity of passengers. If I have a passengers who I find two times

following filters on travels, I don't want to count his frequency of travel

two times.

I will probably have tens of millions of passengers.

An example of the final chart that I want: number of passengers who are in

a Paris-Marseille travel by frequency of travels during the last month on

all travelshttps://lh5.googleusercontent.com/-oNzwV1qmFjA/UkU3yhyxxeI/AAAAAAAAAdo/UK750NoXucI/s1600/2013-09-27+09_45_51-Microsoft+Excel+-+Classeur1.png

Other examples:

Tom has traveled two times:

24/12/2013 at 08:00:00 from Paris to Marseille

26/12/2013 at 10:00:00 from Marseille to Paris

Bob has traveled 1 time:

24/12/2013 at 12:00:00 from Paris to Marseille

John has traveled 1 time:

24/12/2013 at 08:00:00 from Paris to Marseille

First case:

I want to find travelers who travels the 24/12/2013 from Paris

I want to calculate the frequency of travels on the 12th month

What I want is an histogram with to value :

Frequency of travels: 1, number of passengers: 2

Frequency of travels: 2, number of passengers: 1

In english: two passengers from Paris on 24/12/2013 made one travel during

the 12th month

On passenger made two travels

Second case:

Same thing for travelers, I want to find travelers who travels the

24/12/2013 from Paris

I want to calculate the frequency of travels from Paris during last month

What I will get is this histogram

Frequency of travels: 1, number of passengers: 3

In english: three passengers from Paris made one travel from paris during

the last month

Third case;

I want to find all travelers

I want to calculate frequency on all data

What I will get is:

Frequency of travels: 1, number of passengers: 2

Frequency of travels: 2, number of passengers: 1

and not:

Frequency of travels: 1, number of passengers: 2

Frequency of travels: 2, number of passengers: 2

because of the unicity of travelers

I think you get the idea. Or not.

Solution 1: The easy way : 2 or 3 queries

The idea is straightforward:

One query to get id of passengers following filters on travels

With all ids, count number of travels by ids following filters on frequency.

Aggregate these frequencies by number of passengers.

I could probably use a range facets with a script to count travels for each

passenger and get directly data for my histogram

Problem: -query with potentially millions of ids

Solution 2: The cheated way

Preprocess frequency of travels for each passengers following the perimeter

of frequency filter.

A data could be as follow

{

"filter_on_travels": "value1"

"list_of_passengers" : "1234#4 346345#6 214321#1 54325423#4"

}

Term facet on list_of_passengers get unicity on id. I just have to count

frequencies now.

Problem: -No change in the perimeter of frequency

-Probably a big result (many many ids)

-Process on the result

Solution 3: The parent/child facet way

So the process of the query would be as follow :

I find travels that have filter1, I go to the parent, I count travels dor

this id that follow filter2 and then I use a facet (probably range facet or

personalized facet) to have the histogram that I want.

I'm not sure that's possible...

I still learn how to use elasticsearch and parent/child is one thing that I

don't fully unterstand yet

Solution 4: The parent/child/child way

Parent are travels then passengers then travels done by these passengers.

Filter on travels then filter on travels done by passengers then count.

Problem : -Unicity of passenger

-Billions of documents

So my questions are:

Am I the only one around here that have trouble to agregate frequency with

filter on everything ?

Is there a well-known solution ?

Have you some thought about these differents solutions ?

I think I lost everybody but for those who get there and the others, thanks.

Cheers,

Julien

--

You received this message because you are subscribed to the Google Groups "elasticsearch" group.

To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.