Joining Columns


#1

Hi :),

I already tried my luck on StackOverflow but unfortunately without any result. I struggle with some basics and I don't know how to solve the following problem.

I have given the following data structure in Elasticsearch:

sender_email: someone@somewhere
receiver_email: anotherone@somewhereelse
send_date: yyyy-mm-dd
send_time: hh:mm:ss

...and other stuff...

How to answer, for instance, this question: Return all email addresses which occurs more than X times. This seems very trivial but for instance how to join 'sender_email' and 'receiver_email' so that we have just one column? How to apply a 'where' clause where I can set a threshold?

How would the solution look like in Kibana and how on the console with curl?

Many thanks!


(Thomas Neirynck) #2

Hi Adello,

For filtering results, you can look for examples of Elasticsearch filters here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html.

To concatenate columns you will have to use a scripted field in Kibana. In the latest 4.6 release of Kibana, we only support numerical expressions (so no String concatenation), but starting from the 5.0 beta version (currently still in alpha), you will be able to use Painless (a scripting language similar to Groovy) or Groovy.

e.g.: in 5.0, you will be able to store an expression like this as a scripted field:

doc['sender_email'].value + ' ' + doc['receiver_email'].value

For more information on scripted fields: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-script-fields.html.


#3

Hi Thomas,

thank you for your fast reply. The title of the post is misleading maybe a little bit. What I actually really try to do are some basic statistics. I want to figure out, how many times a specific email address is used (for billing reasons for instance). Here, give me all email addresses which were used at least X times ... but because I have 2 columns (just with different meanings) I have to take both columns into account (that's why the title 'joining columns').

Example:

sender_email      receiver_email
 worker@home          boss@work
 worker@home          friend@vacation
 coworker@work        worker@home

In this example I used "worker@home" 3 times. How can I define a query that returns me all email addresses which were used at least 3 times?


(Thomas Neirynck) #4

Hi Adello,

Thanks for the extra info.

Actually, you can continue using scripted fields for that. E.g., after you created one in Kibana, you can use the Datatable visualization to perform such an aggregation. Specifically, you would select the scripted field (the concatenation) from the drop down, and then aggregate with the count.

However, at its core, this isn't a Kibana issue. While Kibana provides you with a wizard to create such a scripted field, you can do this with the search API from Elasticsearch directly using the "Aggregations" feature. The aggregations are used to do things like counts, averages, ... on a field (or in your case, a concatenation of two fields). See https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-aggregations-metrics-valuecount-aggregation.html#_script_10 for some examples, including one that uses a script.

Starting from the example from that link, in your case, it probably would look something like:

{
    ...,

    "aggs" : {
        "avg_grade" : { "avg" : { "script" : "doc['sender_email'].value + ' ' + doc['receiver_email'].value" } }
    }
}

#5

I'd like to share my solution of the above mentioned problem. The solution is to put both email addresses into one additional field. For instance like that:

 sender_email      receiver_email           email_addresses
 worker@home          boss@work                [worker@home,boss@work]
 worker@home          friend@vacation          [worker@home,friend@vacation]
 coworker@work        worker@home              [coworker@work,worker@home]

Now I can use the following query/aggregation in sense:

GET /mail/billing/_search
{
  "size": 0, 
  "aggs": {
    "email_statistic": {
      "terms": {
        "field": "email_addresses",
        "min_doc_count": 2
      }
    }
  }
}

What I don't like in this approach is, that I have to create a special field which contains all email addresses. Does anyone have a better idea?


(system) #6