Log Enrichment with Logstash and the Elastic Filter Plugin - How to Handle Ordering / Relationships?

Howdy,

We're building an analytics system, most likely using the ELK stack from start
to finish, and while we understand the document-oriented nature of Elastic, we're
having some headaches figuring out how to get that kind of data into Elastic.

Our key problem is, quite simply, our application database. It's under enough
load as it is, and we don't want to be querying and JOINing across multiple
MySQL tables (potentially quite a few) just for the sake of saving a log line.

Background

So, here's a rough idea of what we ultimately want to get into Elastic using a
page view as an example (I'm using nested documents here for simplicity - this
may or may not be a good idea / the final approach).

Imagine the use case here as something similar to WordPress Sites with
Pages, Themes, and Widgets - it's close enough...

{
  "_type": "pageview",
  "timestamp": "1234567890",
  "page": {
    "uuid": "DEADBEEF",
    "widgets": [{
      "uuid": "DEADFEED",
      "name": "text",
      "content": "..."
    }, {
      "uuid": "DEADBAAD",
      "name": "image",
      "content": "..."
    }]
  },
  "site": {
    "uuid": "FEE1DEAD",
    "name": "My First Site",
    "creator": {
      "uuid": "DEAD10CC",
      "name": "John Doe",
      "username": "john.doe@example.com"
    }
  },
  "visitor": {
    "uuid": "DEADC0DE",
    "name": "Jane Doe",
    "username": "jane.doe@example.com"
  }
}

We have a normal enough structure for this in MySQL:

  • users
  • sites
    • sites.creator_uuid
  • pages
  • sites_pages
  • widgets
  • sites_widgets

Now, for reasons I'm not going to get into here, our database querying is a
bit...inefficient (mainly the ORM layer actually, to be honest).

We would love the simplicity of being able to log the above JSON structure
directly to disk, but life would be a lot simpler if we didn't have to rejig our
database layer.

Instead, we'd prefer to log:

{
  "_type": "preview",
  "page_uuid": "DEADBEEF",
  "site_uuid": "FEE1DEAD",
  "visitor_uuid": "DEADC0DE"
}

and have Logstash enrich it before it gets saved into Elastic. Of course, now
we have two problems
:

  1. We don't have direct access to our "source of truth" from Logstash (Filebeat
    does, but Filebeat doesn't do enrichment apparently)
  2. We don't know if we can be certain that the information is in Elastic due to:
  • Indices refresh latency
  • Ordering of events from multiple Filebeat harvesters

Finally, to the questions:

Questions

  1. Most important, is our use-case and problem making sense? Is it unclear, or is
    somebody thinking "you don't get it - just do it like this..."?
  2. Assuming we're right, is there any way to ensure, for example, that a
    sitecreation event has been saved to Elastic before processing a pageview?
    • How does this work with multiple log files and Filebeat harvesters? I can
      imagine Filebeat processing pageviews.log just a little bit before it
      processes sitecreations.log, as an example
    • Does that mean we just log all our events to a single file? Does
      Filebeat/Logstash ordering work like that? (I have tried to find out about
      ordering, but my Google-fu isn't up to scratch)
  3. Is there a better way of approaching this problem? It feels like the kind of
    problem somebody else must have had, but I don't even know how to describe it
    in terms of a search query or a post title...

Lastly, any reading material / tutorials / books / gifts-from-the-heavens would be very much appreciated - I don't expect a complete and definite answer for this - just some general pointers would be nice...

Cheers,
-- Craig Roberts

1 Like

Hello Craig,

Why it is important that they be indexed in sequence? Elasticsearch is near-realtime, document based, so each document exists on its own, and unless you are relying on something like the aggregation filter to calculate differences between documents, the sequence of indexing is not usually important because within a short time (seconds) all documents related to an event will exist (and be searchable) in Elasticsearch.

Cheers,
-Robin-

Hi Robin

This is the flow we are imagining.

  1. Site is created
  2. Site is viewed

So filebeat picks up 2 events for pretty much the same second and passes them off to logstash.

Logstash I assume processes events in parallel?

If so then we have a large payload for the site creation which needs to be used for enriching the second site view event.

If the site creation event is not in elastic at the time of querying then we may? have a problem.

Once of the reasons we are wanting to enrich data at logstash end instead of at the source end it because we have to gather a lot of statistics over a connection worse than a 56k modem line so we want the event payload to be as small as possible but still be able to gather the same level of stats that we could with the servers in our datacenter.

Cheers Lee

Just to clarify that we were planning to use the elasticsearch filter plugin to retrieve previous data from Elastic for enrichment. This saves us the problem of retrieving extra data from the DB, but adds a problem of "how do I know site information has been saved and is available" when logging page views from within the same second.

This may sound like a contrived use-case, but it's just the simplest example - there are some other cases we need to do similar enrichment with a dependency on "created" events - sharing during creation is the first thing that comes to my head, and so on.

Edit

I have found a blog post - Document Processing and Elasticsearch which covers some of the options in more detail for us. There is also another discussion on this forum (Post processing of aggregation data) where somebody is having a similar problem:

We don't want to denormalize our event data to contain all of the other relevant info besides the ids [...] Anyone out there have a good solution for storing ids in elasticsearch but then being able to associate those ids with their labels (from another data store but can be made available via a rest endpoint or by syncing them into elasticsearch) for presentation purposes?

Obviously we will now be doing some reading around mappings and custom asynchronous plugins.

Ta,
-- Craig

Hi Lee, Craig,

I think you've determined it yourselves, but to round it off: if you have a distributed path for data to reach elasticsearch (even just having multiple filter worker threads within one logstash agent!), you cannot guarantee that the documents will be indexed in order.
For most cases this is not a problem, and most people solve this by having monitoring in place to identify if there is backpressure in the system (i.e. logs are not being ingested as fast as they are being created), and otherwise assume that within a reasonable time (10 seconds?) all events will have been collected, transmitted and indexed.
If this is unacceptable for you, then you will have to ensure that all events are written to one file, and that this file is processed by exactly one logstash agent with only one filter worker thread.

Cheers,
-Robin-

Hi Robin,

That's exactly the summary we were looking for - thank you very much. We suspected this was the case, but Elastic/Big Data approaches are generally quite new to us, and we wanted a high-level "that's pretty much the case" confirmation of this.

With those considerations in place, we can evaluate our options and plan for a full discussion with Elastic later on.

Thanks again :smile:
-- Craig

If you have multiple log lines attached to the same "global event" (ex: a http request), you can aggregate all interesting informations among multiple log lines, using logstash aggregate filter, and push all this interested aggregated information into one final event that you index into elasticsearch.

Thus, if you want to be sure that logs arrive and are processed in the right order, you have to set "filter threads" to 1.
You can do this using this command line flag : "-w 1"