Pick latest child records of each parent and aggregate data

(Rajesh) #1

Hi ,
I have the following problem

Lets say I have a parent called parent-1 which has childs child-1(version:1,action:NOACTION),child-2(version:2,action:REVOKED) and child-3(version:3,action:CERTIFIED). Child 3 is the latest and currently active version.

Similarly I have another parent parent-2 with child-4 (version:1,action:NOACTION) and child-5(version:2,action:REVOKE) where child-5 is current and active version.

I want to pick the latest and currently active child document (child-3 and child-5) and aggregate them based on action taken.

My Final result should be
CERTIFY : 1
REVOKE : 1

Please let me know the approach to take.

Following are my mappings and data

put testparentchildnew
{
"mappings": {
"_doc": {
"properties": {
"prop1": {
"type": "keyword"
},
"prop2": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"doctype": {
"type": "keyword"
},
"childaction": {
"type": "keyword"
},
"childprop1": {
"type": "keyword"
},
"docversion" : {
"type" : "integer"
},
"childprop2": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"campaign_join_relation": {
"type": "join",
"relations": {
"parent": "child"
}
}
}
}
},
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}
}


post testparentchildnew/_doc/1
{
"prop1" : "parent1prop1",
"prop2" :"parent1prop2" ,
"doctype" : "parent",
"campaign_join_relation": {
"name" : "parent"
}
}

post testparentchildnew/_doc/2
{
"prop1" : "parent2prop1",
"prop2" :"parent2prop2" ,
"doctype" : "parent",
"campaign_join_relation": {
"name" : "parent"
}
}

-- child --
post testparentchildnew/_doc?routing=1
{
"childprop1" : "child1prop1",
"childprop2" :"child1prop2" ,
"childaction" : "NOACTION",
"docversion" : 1,
"doctype" : "child",
"campaign_join_relation": {
"name" : "child",
"parent" : "1"
}
}

post testparentchildnew/_doc?routing=1
{
"childprop1" : "child1prop1..",
"childprop2" :"child1prop2.." ,
"childaction" : "REVOKE",
"docversion" : 2,
"doctype" : "child",
"campaign_join_relation": {
"name" : "child",
"parent" : "1"
}
}

post testparentchildnew/_doc?routing=1
{
"childprop1" : "child1prop1..1",
"childprop2" :"child1prop2..1" ,
"childaction" : "CERTIFY",
"docversion" : 3,
"doctype" : "child",
"campaign_join_relation": {
"name" : "child",
"parent" : "1"
}
}

post testparentchildnew/_doc?routing=2
{
"childprop1" : "child2prop1",
"childprop2" :"child2prop2" ,
"childaction" : "NOACTION",
"docversion" : 2,
"doctype" : "child",
"campaign_join_relation": {
"name" : "child",
"parent" : "2"
}
}

post testparentchildnew/_doc?routing=2
{
"childprop1" : "child2prop1..",
"childprop2" :"child2prop2.." ,
"childaction" : "REVOKE",
"docversion" : 2,
"doctype" : "child",
"campaign_join_relation": {
"name" : "child",
"parent" : "2"
}
}

(Mark Harwood) #2

Hi Rajesh,
If I read the question correctly you're trying to do aggregations on the last-known-state of all the parents.
That's tough to do at scale because while you may have only 3 states in your results (NOACTION / REVOKED / CERTIFIED) you have potentially many parents to consider along the way. Each of the parent IDs will need loading into memory and have their last state derived. I expect this is all possible using a scripted aggregation but it may have memory issues.

How many parents do you expect to have?
If it's too many you may need to update parents with their last known state at index time

(Rajesh) #3

Hi Mark,

Thank You for your response.

Following is our use-case

We have an application which allows Active Directory Administrators to view existing access in current system and take action.

Administrators can take action on individual items or take bulk action on selected items.

We have used elastic for this application. In our current implementation, we have created a single index. The documents inside this contain field storing which stores action taken by administrator.

We are facing performance issues with this approach since administrators can take actions on same item (both individual as well as bulk) multiple times, causing multiple/frequent updates.

Since elastic is not meant for frequent updates, we are thinking of rewriting the application with parent/child approach, where in parent will store the information about access (which will always remain same) and child will have action information(variable).

So for each action taken by admins, we were planning to insert a child entry. The last child entry(based on version) is the one which is the active entry.

This will also solve our audit problem which we have with current system since we can treat each child entry as an audit log of action taken.

This approach fulfills most of our use-case except one where we will have to provide administrators view of what action he has taken(CERTIFY/REVOKE) and the count of accounts on which this action is taken since that will require us to filter last parent child.

The approach you suggested (updating parent with latest child version) will involve updating parent on each action taken by administrator which is what we want to avoid.

Is there any other approach we can try and evaluate? Or if you can guide us to what is the recommended approach in this scenario, it would be great

We are anticipating around 1-5 million parent records in a single index.

Thank You,

Rajesh

(Mark Harwood) #4

Thanks for the detailed explanation - I suppose the more qualified question then is what is the biggest number of unique parent IDs in any one search request? If I understand the business problem correctly I think that translates to "how many accounts does the biggest administrator govern?".

What we're trying to balance here in the solution is the cost of fusion at query-time (lots of memory required) versus fusion at index-time (lots of IO performing updates).

(Rajesh) #5

Each Administrator can have on average 10,000 account to review . Each account can be associated with average 10 AD groups, so on an average each administrator need to review 100,000 items.

If I consider delegation use cases(one admin delegating his access review to other due to vacation), it might go upto 300,000.

If we consider unique parent ids it would be between 1-5 millions as we are planning to add all data in one index and filter based on who can review the records.

(Mark Harwood) #6

I think this information will be a challenge to summarise at query time.

This aggregation is simplified if you maintain a document per account with the current status. You can change many account statuses at index-time if you use the update_by_query command.
Another benefit is that look-ups to see if an individual account is active or revoked become easier if you can GET a single document with the current account state rather than performing a search across all shards to find the latest logged status change event.

(Rajesh) #7

Hi Mark,

Thank you for your insight. We are planning to use database for storing frequent updates and sync to elastic later once a day.

This has led us to another problem

I want to know what would be my latest child record. I can use inner_hits with descending sort on update date stored in child (specify size as 1) to get the desired result.

Is it possible to filter these inner hits ? What I mean to say is that, lets say I have two child records
{ name:a , date: 2018-12-12} , {name:b,date:2018-12-13} having same parent. Inner hits with date sorting(desc) and size=1 will give me record {name:b, date:2018-12-13}. Is it possible to also search for name a in latest record? In plain English, it would be to find parent having latest child record containing name 'a' . For example, in the scenario which I gave earlier, since the latest record is {name:b, date:2018-12-13}, the parent would not appear in result.

Also does inner_hits has performance penalty? I read somewhere that it impact search result timings.

(system) closed #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.