Building Graph Relationships Between Documents


(Andrew Stroz) #1

I have a index that contains a field with the text contents of .docx, .pptx and .pdf documents.

I have another index that holds documents that have a field with a string of characters representing a piece of equipment.

I would like to run a graph query that shows all of the related documents retrieved from full text search to all of the pieces of equipment.

Is this possible?


(Mark Harwood) #2

Is there a field the two indices share?

For the sake of argument let's call that field "equipment_id" and assume it is of the type keyword

I'm guessing the field is of the type text and hidden in the text are some references to items of equipment. If the pattern of an equipment_id is sufficiently unique (e.g. always an 11 digit number) then it might be possible to use a regex to extract these values from the text and place into keyword type field called equipment_id which is an array. Let's also assume each document has a keyword field called doc_id.

Given this setup it would be possible to create a graph of doc_id and equipment_id values and how they are connected purely using the document index (ignoring the equipment index).

This is mostly speculation about your data so I think you may need to fill in some more details about the problem here.

(Andrew Stroz) #3

Yes this is true.

This does not hold true. The equipment_id field is not sufficiently unique to perform regex to extract the values. That is why I was hoping I could relate equipment_id from one set of documents to the full text search results for that equipment_id in the indexed word/ppt/pdf.


(Mark Harwood) #4

If your app can't isolate the numbers from the text at ingest time then elasticsearch will equally have a hard time doing any analysis on this data at query time.

(Andrew Stroz) #5

Is there not some way to visualize with graph the equipment_id as a central node and all of its edges are connected to full text search query results? Preferably with the strength of the node being related to the score returned from the full text search.

(Mark Harwood) #6

Not a clean way, no. We rely on nodes being identified by a combination of fieldname and term which will make life complex if you can't extract equipment IDs out of the text into a field called equipmentID.

(Andrew Stroz) #7

Is there any way for a node to be identified by a document? Or a node to be identified as by a combination of fieldname and term but as the result of a search query.

I would really like to have the ability to relate a 'node being identified by a combination of fieldname and term' to documents that that match a query for that term.

This would allow me to harness the power of Elastic as a full text search service and the visualization that graph offers.

(Mark Harwood) #8

Is this a useful graph visualization? If I understand your requirement it is a star-shaped graph with a single central "query" node and lines connecting out to matching satellite "doc" nodes.

That sounds more usefully drawn as a horizontal bar chart with a bar per doc and bar lengths being doc score?

(Andrew Stroz) #9

The documents that are full text searched based on equipment_id also have other metadata that relate each-other ie. same document type, document status, etc.

The visualization I want to see is star like at the center but documents with equipment_id found in the full text search query are related to each other using other metadata that are in the document.

I will try and play around with graph in my Kibana instance to gain a better understanding of the relationships I can build.

Thanks for your help.

(Mark Harwood) #10

Useful graphs are those that use fields with high-cardinality (many unique terms).
Examples include bank accounts, email addresses or hashtags [1]. These create sparser, interesting shapes. Meaningful relationships exist between rarer terms.

If you choose fields with a small number of values (eg "gender" or your doc type/status fields) then you end up with "hairball" graphs with too many lines, connecting all the nodes. These tend to be much less interesting connections.


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.