I have been exploring and using Elasticsearch for a couple of weeks now,
and coming from a background of exploring, using and tweaking Lucene, Nutch
and Solr for the last 7 years, I am mightily impressed with its simplicity
and elegance.
Of particular interest in our Elasticsearch usage so far has been the JDBC
river plugin, which is again, a great elegant extension. Thanks to Jorg
Prante and others who have contributed there.
We came across a situation that might be very common among those who are
trying to move content from their traditional RDBMS on to elasticsearch, by
first denormalizing at the DB through queries and passing it on through the
river as Structured Objects. Introduction to what is possible was very
clear from the tutorials:
- https://github.com/jprante/elasticsearch-river-jdbc/wiki/Structured-Objects
- http://elasticsearch-users.115913.n3.nabble.com/Ann-JDBC-River-Plugin-for-ElasticSearch-td4019418.html
But the way things are grouped by default in the Structured Objects might
not help completely in all denormalization scenarios. Let me highlight with
an example:
This is the data I have:
Id Name Coursename TimesOffered
1 Andrew Ng Machine Learning 5
1 Andrew Ng Recommender Systems 5
2 Doug Cutting Hadoop Internals 12
2 Doug Cutting Basic of Lucene 25
2 Doug Cutting Advanced Lucene 5
2 Doug Cutting Introduction to Apache Avro 5
So, this data would go into the Structured Object through the river set up:
{
"type": "jdbc",
"jdbc": {
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url": "jdbc:sqlserver://ServerName:1433;databaseName=DatabaseName",
"user": "UserName",
"password": "password",
"sql": "select name as [person.name], coursename as
[person.coursename.name], timesoffered as [person.coursename.count] from
courseofferings"
},
"index": {
"index": "people",
"type": "course",
}
}
The way the data gets stored would be like this:
{
"_index": "people",
"_type": "course",
"_id": "1",
"_score": 1,
"_source": {
"person": {
"name": "Andrew Ng",
"coursename": {
"count": 5,
"name": [
"Machine Learning",
"Recommender Systems"
]
}
}
}
},
{
"_index": "people",
"_type": "course",
"_id": "2",
"_score": 1,
"_source": {
"person": {
"name": "Doug Cutting",
"coursename": {
"count": [
12,
25,
5
],
"name": [
"Hadoop Internals",
"Basic of Lucene",
"Advanced Lucene",
"Introduction to Apache Avro"
]
}
}
}
}
If you look at this carefully, what we wanted to do was to group on the
person (which is perfectly fine) and then group the course offered and the
number of times it was offered. That is, for *coursename *to have a
repeating inner structure of a combination of *name *and count, not name
and count separately as arrays. More like what can be seen below:
{
"_index": "people",
"_type": "course_ideal",
"_id": "gglVnNnMQw6DGexWdoP5vg",
"_score": 1,
"_source": {
"person": {
"name": "Andrew Ng",
"coursename": [
{
"count": 5,
"name": "Machine Learning"
},
{
"count": 5,
"name": "Recommender Systems"
}
]
}
}
},
{
"_index": "people",
"_type": "course_ideal",
"_id": "3uJQdZhVR5CDPTGELx9nMA",
"_score": 1,
"_source": {
"person": {
"name": "Doug Cutting",
"coursename": [
{
"count": 12,
"name": "Hadoop Internals"
},
{
"count": 25,
"name": "Basic of Lucene"
},
{
"count": 5,
"name": "Advanced Lucene"
},
{
"count": 5,
"name": "Introduction to Apache Avro"
}
]
}
}
}
With the default approach, we are losing the grouping as well as some
content, as only distinct counts are being considered. I can understand why
this would happen, as the underlying JSON mapping scheme for both cases
would be the same and would lead to ambiguity (other cases would require
the current default):
{
"course": {
"properties": {
"person": {
"properties": {
"coursename": {
"properties": {
"count": {
"type": "long"
},
"name": {
"type": "string"
}
}
},
"name": {
"type": "string"
}
}
}
}
}
}
Can we do something in terms of the convention in specifying the query
itself, to group things better? Or am I missing a convention that already
exists?
Regards.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.