S3 as intermediate stage in Elastic stack pipeline

Hi Folks,

Thanks to all for creating the Elastic stack (ELK) - it enables small businesses like my own to have visibility of security events and performance data that is too costly or inflexible with proprietary tools.

I have been using ELK stack since 2013 and am changing the way that we consume and process events, namely for our events to better survive changes to index templates / ES version changes / grok pattern problems I am want to send all events to Amazon S3 as an intermediate stage before they end up in Elasticsearch.

For example : Filebeat => Logtash => S3 => Logtash (filters, etc) => Elasticsearch

Some thoughts / questions:

  1. Is this a valid architecture? My aim is to be able to work with my events in a more flexible way (i.e: change mappings more easily, work with smaller or larger data sets more easily) by simply importing portions of data and indexing them from S3
  2. Where should filtering happen? on the first logstash instance or the second? Is the intermediate S3 stage going to have negative effects on my grok patterns or mappings?
  3. I currently have some events that make it through to elasticsearch if I send them directly but silently get dropped if I send them to S3 first. They are using a grok filter.

Any help or advice is much appreciated...see my configs below:

s3-logstash-es.conf
filebeat-logstash-s3.conf

1 Like

Is this a valid architecture?

I wouldn't call it invalid but certainly atypical. Any particular reason you want to use S3 for the intermediate storage? This sort of buffering is usually implemented with a message broker.

Where should filtering happen? on the first logstash instance or the second? Is the intermediate S3 stage going to have negative effects on my grok patterns or mappings?

Correctly implemented this kind of intermediate storage should be transparent and not affect your events at all, so whether you apply filters before or after (or in both places) shouldn't matter.

I currently have some events that make it through to elasticsearch if I send them directly but silently get dropped if I send them to S3 first. They are using a grok filter.

Logstash won't drop events silently.

It is expensive to run an Elasticsearch cluster. Having a S3 bucket as my "source of truth" allows me to :

  • Selectively index my data into Elasticsearch. I can choose to index a subset of my data (by date / type / etc) and then if I need more detail or need to add a type or mapping or change the scope then all I need to to is index those docs
  • A simpler architecture makes it less likely that data will be lost when there is ES downtime / mapping changes / upgrades / breaking changes, etc (I know you can architect ES for high availability, but again it is expensive, especially if you want to keep data for a long time)
  • I can log everything to S3 will --verbose or --debug and only index lower verbosity events
  • I can use Amazon Athena to query more data than in S3 I can afford to keep in ES and then bring load up a 128GB memory EC2 instance to suck in all data into a new or parallel ES cluster for advanced data analytics.

For example I can log every IP address to ever have touched my infrastructure for the last 10 years in S3 and in the eventuality of a compromise I can do a quick query in Athena as part of a cursory incident response and then if I notice something interesting can import debug logs from that point in time into ES for more detailed analysis!

Also I found where the events were being lost - I was using the "json" codec with the S3 plugin which meant that the S3 plugin would truncate json objects. Switching to the json_lines codec resolved this, at least with my current data set. This might cause problems with multiline logs but I will deal with that closer to the time.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.