Parsing / Indexing strategies for BRO?

This is part logstash, part elastic search, but since the decision is being made at the Logstash level, I thought I would ask here:

What are peoples strategies for indexing incoming BRO data? It creates over a dozen logfiles which need to be parsed and indexed into ES.

First, I was storing everything in a single ES index, but reflecting on RDBMS's, I thought that might be inefficient, because some logs have one set of fields, and other logs have completely different sets of fields. So in an effort to "normalize", I've begun storing each log type in its own separate index.

However, sometimes we need to be alerted about events that happen ACROSS log types.

So I've started working on additional logstash steps to normalize a subset of the BRO data (regardess of log type), to store in a new index (bro-combined), so we have a single index to refer to, and can go to the dedicated indexes for more information if needed.

I'm wondering if you all think I'm on the right track, or if i'm going about things in completely the wrong way. It seems to make sense from a perspective on workflow, but I am worried that this strategy will double the insert transactions that ES needs to handle, while the bandwidth difference will be somewhat less extreme (probably 1.4-1.5x).

Interested to hear peoples thoughts.

thank!

How are you doing the alerting?

I'm evaluating SIEMonster, so I'm using the FourOneOne interface supplied as part of that package.

https://demo.fouroneone.io/

You can run multiple IF and ELSE IF statements if you equate them to the different logs into one config file, so there really is no need for different config files.

I like the method Kustodian chose of using separate .conf files for each log type, it's more modular, easier to keep in a VCS, can compare different filters side by side, etc. Either way, it works the same.

More the question was about WHAT to do with the data after log stash, store in a single ES index, a different index for each logtype (protocol), or a combination of the two (individual indexes, with a second "insert" of a subset of each record into a combined index.

I know in the regular database world, it's more efficient to normalize data, I'm not sure how that relates to Lucene/ES yet. Like, is it better to keep everything together even if each set of fields is only used by 10% of the records, or break them out to store and index separately?

Depends.
Sparse data can be stored inefficiently but there's improvements coming in 6.0 in Lucene for that. However, you can read data from different indices by simply querying them at the same time. So I guess it depends on how you interact with Elasticsearch via these other tools.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.