Advice for mapping one to many relationships from ER DB

I'm looking for some advice on mapping a relatively simple dataset from an ER DB to ElasticSearch via LogStash. What I'm currently unsure about is how the mapping is done for zero to many relationships in the ER DB.

Consider the following to visualise the question...

Table: Tickets

TicketID	Summary
1001		"Broken but fixed"
1002		"Also resolved"
1003		"Still broken"

Table: RootCauses

CauseID		PersonID	Cause
2001		1001		"User forgot password"
2002		1001		"User was on holidays for too long"
2003		1001		"Bug in reset password tool"
2004		1002		"Power reset"

Resulting documents something like the following because the mapping between causes and tickets are zero to many:

{
	"TicketID" : 1001,
	"Summary" : "Broken but fixed",
	"Causes" : ["User forgot password", "User was on holidays for too long", "Bug in reset password tool"]
}

{
	"TicketID" : 1002,
	"Summary" : "Also resolved",
	"Causes" : ["Power reset"]
}

{
	"TicketID" : 1003,
	"Summary" : "Still broken",
	"Causes" : []
}

If the number of causes was 1:1 I could simply grab this from the query and map it into my ES output, but because the relationship is zero to many I'm unsure on how to approach this.

  1. How do I convert this relationship from the JDBC input plugin to the ElasticSearch output, is this something done with the query itself or some filter plugin is needed? I'm a bit confused at where to start with this. (EDIT: Okay I think I worked out the first part of the problem by "flattening" this in the query so my query can return something like a single field of "causes" delimited by a comma or something, is this the way this is typically done?).
  2. Any guidance on how the mapping should be done in the index would be appreciated I'm not sure what the best approach is here. Basically in this example the causes would be mapped as keywords which we would use to report ticket resolution causes etc.

Any guidance appreciated here, thank you.

Look into using string aggregations in SQL.
Postgres: https://stackoverflow.com/a/43944/5349531
Oracle: https://stackoverflow.com/a/12421460/5349531
Mysql: https://stackoverflow.com/a/276949/5349531

There are two ways to do this.

  1. Put all the complexity in the JDBC Input statement by adding the aggregation into the full query statement.
  2. Make the JDBC Input query statement a simple one that just gets new tickets since the last run and use the new JDBC Streaming Filter to fetch and add the aggregated string from the RootCauses table.

NOTE: in either solution the case of the field names might be forced into lowercase.

Solution 1

  • Use the JDBC Input to build an Event from the JOINed Tickets and RootCauses tables adding the string aggregation in as an AS named field. Obviously you can build this statement using a SQL GUI tool.
  • Use the split function from the Mutate Filter to split to an array if needed.

Solution 2

  • Start with the JDBC Input to build an Event from the Tickets table.
  • Use the JDBC Streaming filter to fetch the aggregated string from the RootCauses table using the TicketID from each event. You will probably not benefit from results caching because each ticket is unique I guess.
  • Use the split function from the Mutate Filter to split to an array if needed.

Partial config for JDBC Streaming Filter.

  statement => "SELECT string_agg(Causes, ',') AS agg_causes FROM RootCauses WHERE PersonId = :ticketid GROUP BY PersonId;"
  parameters => { "ticketid" => "ticket_id"} 

NOTE: the two strings in the parameters map are: LHS is the "label" used in the SQL statement - in the above example its :ticketid and the RHS is the event field to pluck the substituted value from.

General:
Note: that complex statements will slow things down. Having two JDBC calls with simple statements may be faster than one complex call.

1 Like

Thank for the detailed information and examples, much appreciated. Allow me some time to work through this and I'll report back at a later time once I progress. Thanks!

1 Like

Sorry a bit delayed in circling back here but thank you so much guyboertje for the guidance. The second solution looks more elegant but I've gone with the first one for now and will check into the first option later to optimise once I have all the basics working. Essentially I used a SQL query to aggregate the the result set into a single comma delimited row and then use the split function to separate it. Some snippets in case it helps anyone else.

/* This is an example snippet the actual SQL I used was more complex with additional requiremets,
but for the sake of this example should be useful */
SELECT
	T.TicketID,
	RC.Causes
FROM
	Tickets T
	LEFT OUTER JOIN
		SELECT
			RC.TicketID,
			/* This is the key part which does the aggregation */
			LISTAGG(RC.Cause, ',') WITHIN GROUP (ORDER BY RC.Cause) AS Causes
		FROM RootCauses RC
		GROUP BY RC.TicketID ON T.TicketID = RC.TicketID

# Logstash Configuration file
filter {
	# Mutate plugin
	mutate {
		# split the causes field using comma as a delimmeta
		split => { "causes" => "," }
	}
}

Works perfectly. Initially I thought I had some issues because when there was no entries for given ID in the cause table and it would be null, in the LogStash logs I was getting an error. But once I looked further I realised I still had DEBUG level tracing enabled from earlier troubleshooting and it was really just an informational message saying it couldn't split a null field, but resulted in the expected behaviour.

Thanks for the help! I'll mark this resolved now.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.