Data model for Graph

graph

(Geetha Ram) #1

I need help in reviewing my data model suitable for graph:

I have two entities:
person : name, indicator
docket: docket_number

person and docket have many-many relationship. Each person can be present in one or multiple dockets. Each docket contain multiple persons.

I have tried below variations in each document
1.{name,indicator,docket_number}
2.{name_id,name,indicator,docket_id},{docket_id,docket_number,name_id}

1 has given some positive results. I need to know if this is the best way to represent this model.

Instead of 1 or 2 approach, should I setup data for nodes and edges seperately (similar the panama papers example)

Another question: Graph is showing only subset of string from a node value
In this example, ElasticSearch indexed the Fullname key as a single String.

how can I get the whole Full Name display on the Graph UI:

Thanks!


(Mark Walkom) #2

That looks more like an analysis issue in ES.
You probably want to make the fields not_analyzed so that they aren't broken, or use a mutlifield (aka .raw).

Take a look at https://www.elastic.co/guide/en/elasticsearch/guide/2.x/mapping-intro.html#_index_2 and https://www.elastic.co/guide/en/elasticsearch/reference/2.3/_multi_fields.html


(Mark Harwood) #3

Starting in version 5 (I presume you are using the alpha release based on your previous posts) there is a new "keyword" type that avoids splitting strings into tokens. Your example mapping, docs and graph query would be as follows:

DELETE test
PUT test
{
   "settings": {
	  "number_of_replicas": 0,
	  "number_of_shards":1
   },
   "mappings": {
	  "docket": {
				
		 "properties": {
			"person": {
			   "type": "keyword"
			},
			"docket": {
			   "type": "keyword"
			}            
		 }
	  }
   }
}
POST test/docket
{
	"docket":"23423423",
	"person":["Dianne Carr", "David James"]
}
GET test/_xpack/_graph/_explore
{
	"query": {
		"query_string": {
			"query": "james"
		}
	},
	"controls": {
		"use_significance": false,
		"sample_size": 20000,
		"timeout": 5000
	},
	"connections": {
		"vertices": [
			{
				"field": "docket",
				"size": 5,
				"min_doc_count": 1
			},
			{
				"field": "person",
				"size": 5,
				"min_doc_count": 1
			}
		]
	},
	"vertices": [
		{
			"field": "docket",
			"size": 5,
			"min_doc_count": 1
		},
		{
			"field": "person",
			"size": 5,
			"min_doc_count": 1
		}
	]
}

(Mark Harwood) #4

I realised we only covered your 2nd question and skipped the first one:

No, having multiple people in the same docket should be fine as in my example document with the array of people. The panama papers data was only indexed as classic "edges" with only 2 nodes because that's the way the ICIJ provide the data. A single elasticsearch document can be used to link more than one node (e.g. a docket and 50 people) without the verbosity of creating many edge records.


(Geetha Ram) #5

Thank you. Will be installing alpha version from x-pack. Right now I have got Kibana 4.5.3.
The data is exported to csv file and pushed into ES through logstash.
Below is my document model. Does it work to identify relationship between persons and dockets?
Each document would contain below as single-fields.
{name,indicator,docket_number}

I have gone through resources to represent data in graph model. Gone over your panama papers blog example data, Huge thanks for that post!
Should the data be modeled in terms of nodes and relationships as edges. (I guess this is not needed as you mention during your deep-dive talk). Just wanted to confirm.


(Mark Harwood) #6

It sounds like each "docket" is a complex object with multiple people so can't easily be represented in CSV form - if I understand your example correctly you break it up into multiple rows each of which have a single person and a docket number.
This will work with Graph but there is an important distinction if you break up the docket like this: you can only draw connections between people using the docket numbers like this:


If you index the dockets as a single JSON document then you can dispense with docket numbers and just show connections between people where the line thickness summarises how many dockets they have in common :

This is because Graph summarises the many documents that might contain pairs of entities. Without a single JSON document connecting people directly you will always be forced to drag (potentially many) docket numbers down to the client.
If individual docket numbers are interesting to you then this is OK, otherwise you might want to figure out how to build a more accurate JSON representation of a docket.


(system) #7