Graph API Query with multiple hops

Hi I have data that looks like this:

PUT g_explore/_doc/0
{  "lot_num":"A1", "out_batch": "BATCH940","input_batch": "BATCH770","produce_consume": "BATCH770 BATCH940"}
PUT g_explore/_doc/1
{  "lot_num":"A2-0", "out_batch": "BATCH770","input_batch": "BATCH330","produce_consume": "BATCH770 BATCH330"}
PUT g_explore/_doc/2
{ "lot_num":"A2-2", "out_batch": "BATCH770","input_batch": "BATCH329","produce_consume": "BATCH329 BATCH770"}
PUT g_explore/_doc/3
{ "lot_num":"A3-1", "out_batch": "BATCH330","input_batch": "BATCH328","produce_consume": "BATCH328 BATCH330"}

Where out_batch is the current batch and input batch is the batch that created the lot, aka the parent batch. In other words:

Batch330 was used to make Batch328
Batch 328 & Batch329 were used to make Batch770
Batch 770 was used to make Batch940.

This is a simple example, but the idea holds. In practice it may be as many as 40 layers. I've been trying to do a graph api query to query for the most recent child lot (A1 in this case), and get the chain of all the batches.

I expect to get the links between all these batches, but I can only get one step at a time. I can not find any examples to do as the documents say " Connections can be nested inside the connections object to explore additional relationships in the data. Each level of nesting is considered a hop , and proximity within the graph is often described in terms of hop depth ."

Here is my query:

POST g_explore/_graph/explore
{
  "query": {
    "terms" : {
      "lot_num.keyword":["A1"]
    }
  },
  "controls": {
    "use_significance": false,
    "sample_size": 5,
    "timeout": 5000
  },
  "connections": {
    "vertices": [
      {
      "field":  "out_batch.keyword",
      "size": 20,
      "min_doc_count": 1
    },
      {
      "field":  "input_batch.keyword",
      "size": 20,
      "min_doc_count": 1
    }
    ]
  },
  "vertices": [
     {
      "field":  "lot_num.keyword",
      "size": 20,
      "min_doc_count": 1
    }
    ]
}

Which gives:

{
  "took": 0,
  "timed_out": false,
  "failures": [],
  "vertices": [
    {
      "field": "lot_num.keyword",
      "term": "A1",
      "weight": 1,
      "depth": 0
    },
    {
      "field": "input_batch.keyword",
      "term": "BATCH770",
      "weight": 0.475,
      "depth": 1
    },
    {
      "field": "out_batch.keyword",
      "term": "BATCH940",
      "weight": 0.475,
      "depth": 1
    }
  ],
  "connections": [
    {
      "source": 0,
      "target": 1,
      "weight": 0.475,
      "doc_count": 1
    },
    {
      "source": 0,
      "target": 2,
      "weight": 0.475,
      "doc_count": 1
    }
  ]
}

Great it got the first set of input-output batches, but how do I get the next hop?

Thank you in advance.

Hi Jennifer.
Sadly elasticsearch mappings do not contain enough information for the graph api to do everything that you require.
It doesn’t have knowledge about which values from one field can be found in another. To use an email analogy, it does not know that a “sender” email address may also be found in the “recipient” field of other emails.
You can get around this issue to some extent by copying all values (senders and recipients) into a role-neutral field like “participant” for use in the graph api. Your “produce_consume” field looks like an example of that (if it was a keyword array) and you could try using that.
The disadvantage of this is that the combined field (like my email participants example) has no sense of direction and has lost sight of who was the sender and who was the recipient. So when you use such a field to “crawl” you are just finding any connection and not those of a particular relationship. If you used your produce_consume field to crawl you would be chasing both the parent and child relationships in either direction rather than just following child relations.
The alternative approach is to do the joining work in your client using the regular search api with “terms” queries and aggregations repeatedly for each hop.
Hope this makes sense.

Hi Mark,
Thanks for your reply. If I have a field that connects the documents together, terminal_lot, then it would be okay if it found all connections, because the data is such that it's only connected in one direction. My end game is to create a tree graph in my client. My thought is since I know my final terminal child (the last thing being produced) and if I have all of the connections, then the tree should reveal itself.

If I add a the terminal_lots field, and change the produce_consume to an array, as such:

PUT g_explore/_doc/0
{  "lot_num":"A1", "out_batch": "BATCH940","input_batch": "BATCH770","produce_consume": ["BATCH770","BATCH940"],"term_lot":"A4"}
PUT g_explore/_doc/1
{  "lot_num":"A2-0", "out_batch": "BATCH770","input_batch": "BATCH330","produce_consume": ["BATCH770","BATCH330"],"term_lot":"A4"}
PUT g_explore/_doc/2
{ "lot_num":"A2-2", "out_batch": "BATCH770","input_batch": "BATCH329","produce_consume": ["BATCH329","BATCH770"],"term_lot":"A4"}
PUT g_explore/_doc/3
{ "lot_num":"A3-1", "out_batch": "BATCH330","input_batch": "BATCH328","produce_consume": ["BATCH328","BATCH330"],"term_lot":"A4"}

Using the query:

POST g_explore/_graph/explore
{
  "query": {
    "terms" : {
      "term_lot.keyword":["A4"]
    }
  },
  "controls": {
    "use_significance": false,
    "sample_size": 5,
    "timeout": 5000
  },
  "connections": {
    "vertices": [
      {
      "field":  "produce_consume.keyword",
      "size": 20,
      "min_doc_count": 1
    }
    ]
  },
  "vertices": [
     {
      "field":  "lot_num.keyword",
      "size": 20,
      "min_doc_count": 1
    }
    ]
}

I get:

{
  "took": 0,
  "timed_out": false,
  "failures": [],
  "vertices": [
    {
      "field": "produce_consume.keyword",
      "term": "BATCH328",
      "weight": 0.03125,
      "depth": 1
    },
    {
      "field": "produce_consume.keyword",
      "term": "BATCH940",
      "weight": 0.03125,
      "depth": 1
    },
    {
      "field": "lot_num.keyword",
      "term": "A2-0",
      "weight": 0.25,
      "depth": 0
    },
    {
      "field": "lot_num.keyword",
      "term": "A2-2",
      "weight": 0.25,
      "depth": 0
    },
    {
      "field": "lot_num.keyword",
      "term": "A3-1",
      "weight": 0.25,
      "depth": 0
    },
    {
      "field": "lot_num.keyword",
      "term": "A1",
      "weight": 0.25,
      "depth": 0
    },
    {
      "field": "produce_consume.keyword",
      "term": "BATCH770",
      "weight": 0.09375,
      "depth": 1
    },
    {
      "field": "produce_consume.keyword",
      "term": "BATCH330",
      "weight": 0.0625,
      "depth": 1
    },
    {
      "field": "produce_consume.keyword",
      "term": "BATCH329",
      "weight": 0.03125,
      "depth": 1
    }
  ],
  "connections": [
    {
      "source": 4,
      "target": 7,
      "weight": 0.03125,
      "doc_count": 1
    },
    {
      "source": 5,
      "target": 6,
      "weight": 0.03125,
      "doc_count": 1
    },
    {
      "source": 3,
      "target": 6,
      "weight": 0.03125,
      "doc_count": 1
    },
    {
      "source": 5,
      "target": 1,
      "weight": 0.03125,
      "doc_count": 1
    },
    {
      "source": 2,
      "target": 7,
      "weight": 0.03125,
      "doc_count": 1
    },
    {
      "source": 2,
      "target": 6,
      "weight": 0.03125,
      "doc_count": 1
    },
    {
      "source": 3,
      "target": 8,
      "weight": 0.03125,
      "doc_count": 1
    },
    {
      "source": 4,
      "target": 0,
      "weight": 0.03125,
      "doc_count": 1
    }
  ]
}

Which, of course now that I'm searching by the term_lot. Is there a way to have my vertices to only include the lot_num, but use the information in the produce_consume field to make the connection (edge)?

@Mark_Harwood1 Mark - also what do you mean by using the aggregations in the search query in this context?

I’m afraid I have more questions than answers at this point.

Lot-num values look to be unique to each document but each document represents an edge, not a vertex (connecting one input and one output).

Is the term_lot now a term that identifies the whole tree? If so just querying for the appropriate term_lot value will return all of your example documents which are all the edges required to draw a tree. Aren’t the raw docs just what you need? I’m not sure what use the graph API is in helping discover the tree’s elements or what aggregations could do to usefully summarise the data in the raw docs.

@Mark - Yes, I added the term_lot to define the set of documents that are related together. I wanted to avoid doing this, and let the graph naturally discover the tree, but it seems like at this point it's the most straight forward way. The downside, is that I have to pre-process the data to determine the term_lot number before pushing it into Elastic. Which means I have to pre-process all the data every load. Not so cool.

I have pre-processing script which now finds the term_lot and pushes the data to Elastic, running nightly now. Then storage in Elastic. Then the client app uses the term_lot to get the full set of documents (10-100) that are the set of documents of interest. I wanted to use the graph api to order "connect" the documents together. Since there can be many-many relationships in the data. It's like this:

Lot 1 and Lot 2 are used in Batch A to produce Lot 3 (would have labels input_batch = A; out batch = B). Now Lot 3 and Lot 0 are used in batch B to produce Lot 4. Some of Lot 2 & Lot 4 and some of Lot 0 are used in Batch C to make lot 5. But all I care about are the lot connections.

Lot 1 + Lot 2 = Lot 3; 
Lot 3 + Lot 0 = Lot 4;
Lot 2 + Lot 4 + Lot 0 = Lot 5

Your question makes me think maybe I'm organizing this wrong. I've been thinking my documents are the nodes; and the edges are just the link between them (batches)...

Got it. I suspect the graph api could do the walking but I’m not sure what triggers a walk -

  • a user wanting to know the status of job X?
  • a batch job rendering multiple trees?

If it’s the former you can probably crawl everything from X on demand.
If it’s the latter I’m not sure how you’d safeguard from walking the same tree multiple times from different start points eg crawling from X and crawling from Y could discover the same tree.