Howdy,
We're building an analytics system, most likely using the ELK stack from start
to finish, and while we understand the document-oriented nature of Elastic, we're
having some headaches figuring out how to get that kind of data into Elastic.
Our key problem is, quite simply, our application database. It's under enough
load as it is, and we don't want to be querying and JOINing across multiple
MySQL tables (potentially quite a few) just for the sake of saving a log line.
Background
So, here's a rough idea of what we ultimately want to get into Elastic using a
page view as an example (I'm using nested documents here for simplicity - this
may or may not be a good idea / the final approach).
Imagine the use case here as something similar to WordPress Sites with
Pages, Themes, and Widgets - it's close enough...
{
"_type": "pageview",
"timestamp": "1234567890",
"page": {
"uuid": "DEADBEEF",
"widgets": [{
"uuid": "DEADFEED",
"name": "text",
"content": "..."
}, {
"uuid": "DEADBAAD",
"name": "image",
"content": "..."
}]
},
"site": {
"uuid": "FEE1DEAD",
"name": "My First Site",
"creator": {
"uuid": "DEAD10CC",
"name": "John Doe",
"username": "john.doe@example.com"
}
},
"visitor": {
"uuid": "DEADC0DE",
"name": "Jane Doe",
"username": "jane.doe@example.com"
}
}
We have a normal enough structure for this in MySQL:
users
-
sites
sites.creator_uuid
pages
sites_pages
widgets
sites_widgets
Now, for reasons I'm not going to get into here, our database querying is a
bit...inefficient (mainly the ORM layer actually, to be honest).
We would love the simplicity of being able to log the above JSON structure
directly to disk, but life would be a lot simpler if we didn't have to rejig our
database layer.
Instead, we'd prefer to log:
{
"_type": "preview",
"page_uuid": "DEADBEEF",
"site_uuid": "FEE1DEAD",
"visitor_uuid": "DEADC0DE"
}
and have Logstash enrich it before it gets saved into Elastic. Of course, now
we have two problems:
- We don't have direct access to our "source of truth" from Logstash (Filebeat
does, but Filebeat doesn't do enrichment apparently) - We don't know if we can be certain that the information is in Elastic due to:
- Indices refresh latency
- Ordering of events from multiple Filebeat harvesters
Finally, to the questions:
Questions
- Most important, is our use-case and problem making sense? Is it unclear, or is
somebody thinking "you don't get it - just do it like this..."? - Assuming we're right, is there any way to ensure, for example, that a
sitecreation
event has been saved to Elastic before processing apageview
?- How does this work with multiple log files and Filebeat harvesters? I can
imagine Filebeat processingpageviews.log
just a little bit before it
processessitecreations.log
, as an example - Does that mean we just log all our events to a single file? Does
Filebeat/Logstash ordering work like that? (I have tried to find out about
ordering, but my Google-fu isn't up to scratch)
- How does this work with multiple log files and Filebeat harvesters? I can
- Is there a better way of approaching this problem? It feels like the kind of
problem somebody else must have had, but I don't even know how to describe it
in terms of a search query or a post title...
Lastly, any reading material / tutorials / books / gifts-from-the-heavens would be very much appreciated - I don't expect a complete and definite answer for this - just some general pointers would be nice...
Cheers,
-- Craig Roberts