Logstash-elasticsearch design question: process XML payloads from database

(Kurt V) #1

Hi,
I've been given the task to rewrite an application that searches for audit data in an SQL Sever database. Currently the data is being queried using regular SQL queries, but I would like to go for a better approach, using Logstash and Elasticsearch and step away from the current setup, which is performing poorly.

The database is currently about 1TB in size, and the relevant data is stored in 5 tables (normalized but no foreign keys constraints). Every day, around 10000 records are added to these tables.

My main concern is the processing of audit events, stored in the Events table. The actual event payload is stored in a PAYLOAD varchar field, but the payload is actually an XML. The "problem" here is that there are around 150 types of payloads, each with a different XML format and
full text search must be possible on the payload. Mapping the different XML formats would be a big task, so my question is can I just store the payload field in the index and still enable full text search?

My idea is to denormalize the different tables, and store just one event document (containing data from the 5 relevant tables). Or should I create second index, just containing the XML payload data?

I don't have extensive elasticsearch experience, so maybe I'm missing something.

Thanks!

#2

If you just load the XML as text then it will be as searchable as any other text field.

That said, an xml filter will parse arbitrary XML. You don't need to tell it the structure. However, with 150 formats you may bump up against the distinct field limit in elasticsearch. But if they have certain common elements maybe you map those and turn off dynamic mapping so the rest of the fields are not stored.

Another way of handling common elements is to use XPath. For example, if I have 100 different payment formats I may know that they all contain a PostingDate element, so I could use XPath to grab that.