Enriching new documents with fields from other ES indexes


Hey guys

I'm looking for a very performant way of enriching new log data with data already in ES.

For a potential scenario assume that we have a (scrubbed) copy of our MySQL users table in ES. This logstash instance runs every 30 mins so it is an almost real-time copy of our production sql stores.

  jdbc {
    jdbc_driver_library => "***********"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_user => "******"
    jdbc_password => "*****"
    jdbc_connection_string => "jdbc:mysql://*****:3306/mydb"
    statement => "select id, created_at, email from users"
    tags => ["users"]

output {
  elasticsearch {
    host => "******************"
    port => 80
    protocol => "http"
    index => "users"
    document_id => "%{id}-%{environment}"

Note: we have more than 1 environment's users tables going into the same index, differentiated by document_id => "%{id}-%{environment}".

Now for every user-event log that comes into (a different) logstash instance I want to enrich it with the user's email address. Ideally by doing something like this:

		hosts => ["**********"]
                **index** => "users"
		query => "id:%{user_id} AND environment:%{environment}"
		fields => ["email", "id"] # For some reason I'm getting this error with just "email" here # This field must contain an even number of items, got 1

Note: index here as far as i can tell doesn't exist as a field. In the logstash docs they seem to be using the type variable. I deliberately haven't been using type to differentiate logs because of what i was told at one of the Elastic Elasticsearch workshops. Seemed like there was some complication with using types at the elasticsearch level and the ramifications at the lucene index level, and that "it was a regrettable architectural decision that may be remedied in future ES versions."

So in my case, where the types of most of my documents in elasticsearch are the same, the only thing that differentiates them is the index they are in, what should I do? If i can only use type here this causes a problem because multiple indexes have the id field but they mean different things. In this case I want to get the document from index => "users" with id:1 and NOT the document from logstash with id:1.

Another thing is that I have noticed that this may not be particularly performant and probably won't scale >100/sec if there is no caching on the elasticsearch filter side (and it will cause extremely heavy load on the elasticsearch cluster - 100 requests a second).
Does anyone have any different approaches of achieving this same enriching process? I would be open to any kind of approach. Perhaps I should wait until the data is in Elasticsearch and run a job over the user-events index which grabs the email field from the users index and add_field to every document in user-events.

Note also that in this scenario we are just pulling in the email field, and so one may suggest "just put the email field into the log at the source (ie, on the machine creating the log)". Well this is a simplified example, bringing in 1 field from 1 index. We want to bring in >10 fields from multiple indexes. Plus most of the data we actually want to enrich the log with doesn't exist at the source.


(system) #2