Best method of handling arbitrary document joins

(Erik Miller) #1

I've done some research and everywhere I look it sounds like nosql dbs, including elastic search use application side joins. For ES I understand the nested/parent-child options and have both implemented in my existing stack. But these are limited in functionality to a traditional join.

I'm wondering if there is a good solution to handling joins in a generic sense. For example, I have two document types with no nested/parent-child relationship set up in ES. But they do have ID's that can be used to join them. Is there a plugin/workflow/solution to do this other than a custom application side join?

I understand this isn't something ES is intended to handle and "this is what a relation DB is for".

I'm wondering is there is another pattern/plugin/option to do arbitrary document joins from ES?

(Mark Harwood) #2

See the entity centric indexing approach which shifts the join compute cost to index-time rather than query-time.

(Dave Martin) #3

Is there a whitepaper or something we can look at? Hopping around a video to find the info you're looking for just doesn't work well.

(Mark Harwood) #4

There's example data and scripts here:

(Dave Martin) #5

Let me try again.

Please define 'entity centric' and what it means to database structure.

(Mark Harwood) #6

Many elasticsearch systems capture "events" - a timestamped record of some activity. These indices are "event-centric" - one document per event and often organised into time-based indices e.g. an index per day or week or month.
Events are generated by "entities" - nouns in the real-world such as people, ip addresses, cars etc. Each entity typically generates many events. Anyone attempting to analyse the behaviour of entities (e.g. length of time spent on a website, people-who-bought-x-also-bought-?) typically have a hard time doing this on an event-centric index where the data is not centered around an entity. An entity-centric index brings a summary of an entity's activity into a single document.

(Erik Miller) #7

My index is actually entity centric to begin with. My specific use case is the ability to perform custom queries against a complex document type, and then another query against a different document type and find the intersection.

The queries used are dynamic in nature so we can't tune an additional join document as each query changes frequently.

(Dave Martin) #8

We've been experimenting with this, under a different name, as things pop
up where it would be useful. We're trying to concentrate on flows, not
exports, so when we get an event that mentions an ip, it not only gets
indexed into the log index, we also perform an update to an IP state
table. That way we have the history (of all IPs) in the event logs, and
the current (last obseved) state (of all IPs) in the state table. We're
growing more state tables as more entity types become needed.

Thank you.

(Ecc256) #9

I do have somewhat similar question.
We do have set of servers behind load balancer.
There are several event streams like web, error and performance logs.
Event across streams can be linked/joined by time interval and server name.
What would be the right way to do it?
I.e. if we see CPU spike on a server, we want to know what web requests were executed and if there are any errors around this time.
Seems like filtering all streams by server name and time interval should do it?
So the only requirement is server name field should be named exactly the same in all event streams, right?

I didn’t mean to hijack thread. If it not related, I can repost it as separate question.

(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.