Logstash: Index complex structure from jdbc input

Hi,

I'd like to inject data from sql databases into Elasticsearch.

Let say I have 3 tables: Company, Department and Employee.
Each company has several department and each department has several employees.
This example has 2 level of nested informations.

I'd like to build and index a Company document like this:

   {
"comp_name": "World Company",
"comp_address": " 123 blablabla street...",
"departments": [
 {
     "dep_id": 1,
     "dep_name": "Human ressources",
     "employees": [
           {   "name": "John Smith", "age": "35", ...   },
           {   "name": "Peter Nash", "age": "46", ...   }
           ...
      ]  
 },
 
 {
     "dep_id": 2,
     "dep_name": "Accounting",
     "employees": [
           {   "name": "Lisa  Paulson", "age": "26", ...   },
           {   "name": "Mary Davids", "age": "32", ...   }
           ...
      ]  
 }
...
 ]
}

I cannot find a Logstash example explaining how to aggregate a TWO LEVEL nested structure.

Can anyone help me ?

Regards,

Lucas

Why are you looking to index a whole company as a single document? How do you want to search your data?

Would it not make sense to denormalise and store each employee as a separate document together with the relevant company and department information (which I would expect to rarely change)?

Hi Christian,

I want to be able to:

  • Search in which company/dept is working an employee
  • Find all companies having an employee whose lastname is equal to xxx
  • Find all companies where a specific employee is working
  • Find all companies whose employees having less than 35 years represent more than 50% of the company's staff
  • Get the average age of company's/dept's employees
  • Display employees repartition in all depts of a company by gender, age, etc.
  • Display all informations from a company (Kibana or other)

Initially, i wanted to create a company document with parent-child relations for department and parent-child relation between dept and employees but that functionality was deprecated in last versions of Elastic.

Lucas

I would recommend looking into denormalising it the way I described earlier. As Elasticsearch is not a relational system, it is important to not try to model data in it based on a relational mindset and try using nested documents and/or parent-child as a replacement for foreign keys.

Hi Christian,

It seems that the parent-child relation has been replaced by the join relations which is only used to build relations between documents of the same type which is clearly not my case.

As far as I understand the way nested documents works, it does not really change my problem as they are embedded in the whole document.

I do not necessarily think any of those approaches are required here and would recommend you to look into/consider flattening/denormalizing the model and index a document per employee something like this:

{
  "comp_name": "World Company",
  "comp_address": " 123 blablabla street...",
  "dep_id": 1,
  "dep_name": "Human ressources",
  "employee": {   
    "name": "John Smith",
    "age": "35"
  }
}

This will make querying as well as indexing a lot easier and most likely also more efficient.

Ok I understand what you mean but in this case, creating a document for each combination of employee-department-company isn't replicating too much the data?

As the employee is the grandchild of the structure, for each employee, the document will hold the entire company and department which will be repeated for each employee!

Is Elasticsearch suited for such data replication?

I guess that the id of my document must be autogenerated then?

Denormalising data this is very common when modelling data in Elasticsearch, and Elasticsearch is able to handle this quite well. I would recommend trying it out to see how it performs.

Ok so i'm trying to reproduce your proposition but when i inject the data, logstash is not respecting the structure of the employee field.

Here is my aggregate function:

    filter{
  aggregate{
    task_id => "%{taskid}"
    code => "
    map[‘comp_id’] = event.get(‘comp_id’)
    map[‘comp_name’] = event.get(‘comp_name’)
    map[‘dept_id’] = event.get(‘dept_id’)
    map[‘dept_name’] = event.get(‘dept_name’)
    map[‘employee’] << {
      ‘emp_id’ => event.get(‘emp_id’),
      ‘emp_name’ => event.get(‘emp_name’),
      ‘emp_age’ => event.get(‘emp_age'),
    }
    event.cancel();
    "
    push_previous_map_as_event => true
    timeout => 60
  }
}

And here is a sample of what was indexed

{
  "comp_name": "World Company",
  "comp_address": " 123 blablabla street...",
  "dep_id": 1,
  "dep_name": "Human ressources",
  "emp_id »: "A454607 »
  "emp_name": "John Smith",
  "emp_age": "35 »,
  }

}

So the employee level is not there.

Any idea of what's wrong?

Why are you using the aggregate filter? Why not use a perform a query joining all tables so you get the right structure directly from the database?

I just removed all the filter block but the hierarchy of my document is still not respected. Il is flatened and i don't have the employee level in de document??

Can you show us the full config and what the data looks like?

The data are from a select where everything is like in the document meaning I have no employee reference.

Column are:

comp_id |comp_name | dep_id | dep_name | emp_id | emp_name | emp_age |

Here is logstash config

input {
  jdbc {
jdbc_connection_string => "jdbc:sqlserver://192.168.1.99:1433;database=companies"
jdbc_driver_library => "/usr/local/Cellar/logstash/6.2.4/bin/mssql-jdbc-7.0.0.jre8.jar"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_user => "my_user"
jdbc_password => "my_password"

statement_filepath => "/usr/local/Cellar/logstash/6.2.4/data/emp.sql"

  }
}


output {
  stdout { codec => rubydebug }
  elasticsearch {
document_type => "employe_line"
hosts => ["localhost:9200"]
index => "employee"
codec => json
  }
}

And here is the mapping of the employee index

PUT /employe
{
  "mappings": {
"employe_line": {
  "properties": {
    "comp_id": {
      "type": "integer"
    },
    "comp_name": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
    },
    "dept_id": {
      "type": "integer"
    },
    "dept_name": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
    },
    "employee": {
      "properties": {
        "emp_id": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "emp_name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "emp_age": {
          "type": "integer"
        }
      }
    }
  }
}
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.