A potential extension to the Structured Objects of JDBC River?


(Venkateshprasanna) #1

I have been exploring and using Elasticsearch for a couple of weeks now,
and coming from a background of exploring, using and tweaking Lucene, Nutch
and Solr for the last 7 years, I am mightily impressed with its simplicity
and elegance.

Of particular interest in our Elasticsearch usage so far has been the JDBC
river plugin, which is again, a great elegant extension. Thanks to Jorg
Prante and others who have contributed there.

We came across a situation that might be very common among those who are
trying to move content from their traditional RDBMS on to elasticsearch, by
first denormalizing at the DB through queries and passing it on through the
river as Structured Objects. Introduction to what is possible was very
clear from the tutorials:

  1. https://github.com/jprante/elasticsearch-river-jdbc/wiki/Structured-Objects
  2. http://elasticsearch-users.115913.n3.nabble.com/Ann-JDBC-River-Plugin-for-ElasticSearch-td4019418.html

But the way things are grouped by default in the Structured Objects might
not help completely in all denormalization scenarios. Let me highlight with
an example:

This is the data I have:

Id Name Coursename TimesOffered
1 Andrew Ng Machine Learning 5
1 Andrew Ng Recommender Systems 5
2 Doug Cutting Hadoop Internals 12
2 Doug Cutting Basic of Lucene 25
2 Doug Cutting Advanced Lucene 5
2 Doug Cutting Introduction to Apache Avro 5

So, this data would go into the Structured Object through the river set up:

{
"type": "jdbc",
"jdbc": {
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url": "jdbc:sqlserver://ServerName:1433;databaseName=DatabaseName",
"user": "UserName",
"password": "password",
"sql": "select name as [person.name], coursename as
[person.coursename.name], timesoffered as [person.coursename.count] from
courseofferings"
},
"index": {
"index": "people",
"type": "course",
}
}

The way the data gets stored would be like this:

     {
        "_index": "people",
        "_type": "course",
        "_id": "1",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Andrew Ng",
              "coursename": {
                 "count": 5,
                 "name": [
                    "Machine Learning",
                    "Recommender Systems"
                 ]
              }
           }
        }
     },
     {
        "_index": "people",
        "_type": "course",
        "_id": "2",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Doug Cutting",
              "coursename": {
                 "count": [
                    12,
                    25,
                    5
                 ],
                 "name": [
                    "Hadoop Internals",
                    "Basic of Lucene",
                    "Advanced Lucene",
                    "Introduction to Apache Avro"
                 ]
              }
           }
        }
     }

If you look at this carefully, what we wanted to do was to group on the
person (which is perfectly fine) and then group the course offered and the
number of times it was offered. That is, for *coursename *to have a
repeating inner structure of a combination of *name *and count, not name
and count separately as arrays. More like what can be seen below:

     {
        "_index": "people",
        "_type": "course_ideal",
        "_id": "gglVnNnMQw6DGexWdoP5vg",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Andrew Ng",
              "coursename": [
                 {
                    "count": 5,
                    "name": "Machine Learning"
                 },
                 {
                    "count": 5,
                    "name": "Recommender Systems"
                 }
              ]
           }
        }
     },
     {
        "_index": "people",
        "_type": "course_ideal",
        "_id": "3uJQdZhVR5CDPTGELx9nMA",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Doug Cutting",
              "coursename": [
                 {
                    "count": 12,
                    "name": "Hadoop Internals"
                 },
                 {
                    "count": 25,
                    "name": "Basic of Lucene"
                 },
                 {
                    "count": 5,
                    "name": "Advanced Lucene"
                 },
                 {
                    "count": 5,
                    "name": "Introduction to Apache Avro"
                 }
              ]
           }
        }
     }

With the default approach, we are losing the grouping as well as some
content, as only distinct counts are being considered. I can understand why
this would happen, as the underlying JSON mapping scheme for both cases
would be the same and would lead to ambiguity (other cases would require
the current default):

{
"course": {
"properties": {
"person": {
"properties": {
"coursename": {
"properties": {
"count": {
"type": "long"
},
"name": {
"type": "string"
}
}
},
"name": {
"type": "string"
}
}
}
}
}
}

Can we do something in terms of the convention in specifying the query
itself, to group things better? Or am I missing a convention that already
exists?

Regards.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

I have released JDBC river version 2.2.2 where inner object array creation
is implemented, as described in the request.

To control the creation, special SQL column names with square bracket
notation are introduced. They define the name of an object in an array
list. If this object has already a value, a next array element is created.

Example:
https://github.com/jprante/elasticsearch-river-jdbc/blob/master/src/test/java/org/xbib/elasticsearch/river/jdbc/support/ValueListenerTests.java#L193

Jörg

On Thu, Oct 10, 2013 at 7:42 AM, Venkateshprasanna hmvprasanna@gmail.comwrote:

I have been exploring and using Elasticsearch for a couple of weeks now,
and coming from a background of exploring, using and tweaking Lucene, Nutch
and Solr for the last 7 years, I am mightily impressed with its simplicity
and elegance.

Of particular interest in our Elasticsearch usage so far has been the JDBC
river plugin, which is again, a great elegant extension. Thanks to Jorg
Prante and others who have contributed there.

We came across a situation that might be very common among those who are
trying to move content from their traditional RDBMS on to elasticsearch, by
first denormalizing at the DB through queries and passing it on through the
river as Structured Objects. Introduction to what is possible was very
clear from the tutorials:

https://github.com/jprante/elasticsearch-river-jdbc/wiki/Structured-Objects
2.
http://elasticsearch-users.115913.n3.nabble.com/Ann-JDBC-River-Plugin-for-ElasticSearch-td4019418.html

But the way things are grouped by default in the Structured Objects might
not help completely in all denormalization scenarios. Let me highlight with
an example:

This is the data I have:

Id Name Coursename TimesOffered
1 Andrew Ng Machine Learning 5
1 Andrew Ng Recommender Systems 5
2 Doug Cutting Hadoop Internals 12
2 Doug Cutting Basic of Lucene 25
2 Doug Cutting Advanced Lucene 5
2 Doug Cutting Introduction to Apache Avro 5

So, this data would go into the Structured Object through the river set up:

{
"type": "jdbc",
"jdbc": {
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url": "jdbc:sqlserver://ServerName:1433;databaseName=DatabaseName",
"user": "UserName",
"password": "password",
"sql": "select name as [person.name], coursename as [
person.coursename.name], timesoffered as [person.coursename.count] from
courseofferings"
},
"index": {
"index": "people",
"type": "course",
}
}

The way the data gets stored would be like this:

     {
        "_index": "people",
        "_type": "course",
        "_id": "1",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Andrew Ng",
              "coursename": {
                 "count": 5,
                 "name": [
                    "Machine Learning",
                    "Recommender Systems"
                 ]
              }
           }
        }
     },
     {
        "_index": "people",
        "_type": "course",
        "_id": "2",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Doug Cutting",
              "coursename": {
                 "count": [
                    12,
                    25,
                    5
                 ],
                 "name": [
                    "Hadoop Internals",
                    "Basic of Lucene",
                    "Advanced Lucene",
                    "Introduction to Apache Avro"
                 ]
              }
           }
        }
     }

If you look at this carefully, what we wanted to do was to group on the
person (which is perfectly fine) and then group the course offered and the
number of times it was offered. That is, for *coursename *to have a
repeating inner structure of a combination of *name *and count, not
name and count separately as arrays. More like what can be seen below:

     {
        "_index": "people",
        "_type": "course_ideal",
        "_id": "gglVnNnMQw6DGexWdoP5vg",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Andrew Ng",
              "coursename": [
                 {
                    "count": 5,
                    "name": "Machine Learning"
                 },
                 {
                    "count": 5,
                    "name": "Recommender Systems"
                 }
              ]
           }
        }
     },
     {
        "_index": "people",
        "_type": "course_ideal",
        "_id": "3uJQdZhVR5CDPTGELx9nMA",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Doug Cutting",
              "coursename": [
                 {
                    "count": 12,
                    "name": "Hadoop Internals"
                 },
                 {
                    "count": 25,
                    "name": "Basic of Lucene"
                 },
                 {
                    "count": 5,
                    "name": "Advanced Lucene"
                 },
                 {
                    "count": 5,
                    "name": "Introduction to Apache Avro"
                 }
              ]
           }
        }
     }

With the default approach, we are losing the grouping as well as some
content, as only distinct counts are being considered. I can understand why
this would happen, as the underlying JSON mapping scheme for both cases
would be the same and would lead to ambiguity (other cases would require
the current default):

{
"course": {
"properties": {
"person": {
"properties": {
"coursename": {
"properties": {
"count": {
"type": "long"
},
"name": {
"type": "string"
}
}
},
"name": {
"type": "string"
}
}
}
}
}
}

Can we do something in terms of the convention in specifying the query
itself, to group things better? Or am I missing a convention that already
exists?

Regards.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Venkateshprasanna) #3

This looks great Jörg, will try this right away! Thanks.

On Monday, October 14, 2013 3:54:40 AM UTC+5:30, Jörg Prante wrote:

I have released JDBC river version 2.2.2 where inner object array creation
is implemented, as described in the request.

https://github.com/jprante/elasticsearch-river-jdbc

To control the creation, special SQL column names with square bracket
notation are introduced. They define the name of an object in an array
list. If this object has already a value, a next array element is created.

Example:
https://github.com/jprante/elasticsearch-river-jdbc/blob/master/src/test/java/org/xbib/elasticsearch/river/jdbc/support/ValueListenerTests.java#L193

Jörg

On Thu, Oct 10, 2013 at 7:42 AM, Venkateshprasanna <hmvpr...@gmail.com<javascript:>

wrote:

I have been exploring and using Elasticsearch for a couple of weeks now,
and coming from a background of exploring, using and tweaking Lucene, Nutch
and Solr for the last 7 years, I am mightily impressed with its simplicity
and elegance.

Of particular interest in our Elasticsearch usage so far has been the
JDBC river plugin, which is again, a great elegant extension. Thanks to
Jorg Prante and others who have contributed there.

We came across a situation that might be very common among those who are
trying to move content from their traditional RDBMS on to elasticsearch, by
first denormalizing at the DB through queries and passing it on through the
river as Structured Objects. Introduction to what is possible was very
clear from the tutorials:

https://github.com/jprante/elasticsearch-river-jdbc/wiki/Structured-Objects
2.
http://elasticsearch-users.115913.n3.nabble.com/Ann-JDBC-River-Plugin-for-ElasticSearch-td4019418.html

But the way things are grouped by default in the Structured Objects might
not help completely in all denormalization scenarios. Let me highlight with
an example:

This is the data I have:

Id Name Coursename TimesOffered
1 Andrew Ng Machine Learning 5
1 Andrew Ng Recommender Systems 5
2 Doug Cutting Hadoop Internals 12
2 Doug Cutting Basic of Lucene 25
2 Doug Cutting Advanced Lucene 5
2 Doug Cutting Introduction to Apache Avro 5

So, this data would go into the Structured Object through the river set
up:

{
"type": "jdbc",
"jdbc": {
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url": "jdbc:sqlserver://ServerName:1433;databaseName=DatabaseName",
"user": "UserName",
"password": "password",
"sql": "select name as [person.name], coursename as [
person.coursename.name], timesoffered as [person.coursename.count] from
courseofferings"
},
"index": {
"index": "people",
"type": "course",
}
}

The way the data gets stored would be like this:

     {
        "_index": "people",
        "_type": "course",
        "_id": "1",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Andrew Ng",
              "coursename": {
                 "count": 5,
                 "name": [
                    "Machine Learning",
                    "Recommender Systems"
                 ]
              }
           }
        }
     },
     {
        "_index": "people",
        "_type": "course",
        "_id": "2",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Doug Cutting",
              "coursename": {
                 "count": [
                    12,
                    25,
                    5
                 ],
                 "name": [
                    "Hadoop Internals",
                    "Basic of Lucene",
                    "Advanced Lucene",
                    "Introduction to Apache Avro"
                 ]
              }
           }
        }
     }

If you look at this carefully, what we wanted to do was to group on the
person (which is perfectly fine) and then group the course offered and the
number of times it was offered. That is, for *coursename *to have a
repeating inner structure of a combination of *name *and count, not
name and count separately as arrays. More like what can be seen below:

     {
        "_index": "people",
        "_type": "course_ideal",
        "_id": "gglVnNnMQw6DGexWdoP5vg",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Andrew Ng",
              "coursename": [
                 {
                    "count": 5,
                    "name": "Machine Learning"
                 },
                 {
                    "count": 5,
                    "name": "Recommender Systems"
                 }
              ]
           }
        }
     },
     {
        "_index": "people",
        "_type": "course_ideal",
        "_id": "3uJQdZhVR5CDPTGELx9nMA",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Doug Cutting",
              "coursename": [
                 {
                    "count": 12,
                    "name": "Hadoop Internals"
                 },
                 {
                    "count": 25,
                    "name": "Basic of Lucene"
                 },
                 {
                    "count": 5,
                    "name": "Advanced Lucene"
                 },
                 {
                    "count": 5,
                    "name": "Introduction to Apache Avro"
                 }
              ]
           }
        }
     }

With the default approach, we are losing the grouping as well as some
content, as only distinct counts are being considered. I can understand why
this would happen, as the underlying JSON mapping scheme for both cases
would be the same and would lead to ambiguity (other cases would require
the current default):

{
"course": {
"properties": {
"person": {
"properties": {
"coursename": {
"properties": {
"count": {
"type": "long"
},
"name": {
"type": "string"
}
}
},
"name": {
"type": "string"
}
}
}
}
}
}

Can we do something in terms of the convention in specifying the query
itself, to group things better? Or am I missing a convention that already
exists?

Regards.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Venkateshprasanna) #4

Looking at the test cases, this scenario must work if I change the river
query from:

"sql": "select name as [person.name], coursename as [person.coursename.name],
timesoffered as [person.coursename.count] from courseofferings"

to:

"sql": "select name as [person.name], coursename as "person.coursename[namehttp://person.coursename.name/]",
timesoffered as "person.coursename[count]" from courseofferings"

Of course, it works like a charm. Also putting this in here as a line of
documentation for this feature.

On Tuesday, October 15, 2013 5:38:00 AM UTC+5:30, Venkateshprasanna wrote:

This looks great Jörg, will try this right away! Thanks.

On Monday, October 14, 2013 3:54:40 AM UTC+5:30, Jörg Prante wrote:

I have released JDBC river version 2.2.2 where inner object array
creation is implemented, as described in the request.

https://github.com/jprante/elasticsearch-river-jdbc

To control the creation, special SQL column names with square bracket
notation are introduced. They define the name of an object in an array
list. If this object has already a value, a next array element is created.

Example:
https://github.com/jprante/elasticsearch-river-jdbc/blob/master/src/test/java/org/xbib/elasticsearch/river/jdbc/support/ValueListenerTests.java#L193

Jörg

On Thu, Oct 10, 2013 at 7:42 AM, Venkateshprasanna hmvpr...@gmail.comwrote:

I have been exploring and using Elasticsearch for a couple of weeks now,
and coming from a background of exploring, using and tweaking Lucene, Nutch
and Solr for the last 7 years, I am mightily impressed with its simplicity
and elegance.

Of particular interest in our Elasticsearch usage so far has been the
JDBC river plugin, which is again, a great elegant extension. Thanks to
Jorg Prante and others who have contributed there.

We came across a situation that might be very common among those who are
trying to move content from their traditional RDBMS on to elasticsearch, by
first denormalizing at the DB through queries and passing it on through the
river as Structured Objects. Introduction to what is possible was very
clear from the tutorials:

https://github.com/jprante/elasticsearch-river-jdbc/wiki/Structured-Objects
2.
http://elasticsearch-users.115913.n3.nabble.com/Ann-JDBC-River-Plugin-for-ElasticSearch-td4019418.html

But the way things are grouped by default in the Structured Objects
might not help completely in all denormalization scenarios. Let me
highlight with an example:

This is the data I have:

Id Name Coursename TimesOffered
1 Andrew Ng Machine Learning 5
1 Andrew Ng Recommender Systems 5
2 Doug Cutting Hadoop Internals 12
2 Doug Cutting Basic of Lucene 25
2 Doug Cutting Advanced Lucene 5
2 Doug Cutting Introduction to Apache Avro 5

So, this data would go into the Structured Object through the river set
up:

{
"type": "jdbc",
"jdbc": {
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url": "jdbc:sqlserver://ServerName:1433;databaseName=DatabaseName",
"user": "UserName",
"password": "password",
"sql": "select name as [person.name], coursename as [
person.coursename.name], timesoffered as [person.coursename.count] from
courseofferings"
},
"index": {
"index": "people",
"type": "course",
}
}

The way the data gets stored would be like this:

     {
        "_index": "people",
        "_type": "course",
        "_id": "1",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Andrew Ng",
              "coursename": {
                 "count": 5,
                 "name": [
                    "Machine Learning",
                    "Recommender Systems"
                 ]
              }
           }
        }
     },
     {
        "_index": "people",
        "_type": "course",
        "_id": "2",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Doug Cutting",
              "coursename": {
                 "count": [
                    12,
                    25,
                    5
                 ],
                 "name": [
                    "Hadoop Internals",
                    "Basic of Lucene",
                    "Advanced Lucene",
                    "Introduction to Apache Avro"
                 ]
              }
           }
        }
     }

If you look at this carefully, what we wanted to do was to group on the
person (which is perfectly fine) and then group the course offered and the
number of times it was offered. That is, for *coursename *to have a
repeating inner structure of a combination of *name *and count, not
name and count separately as arrays. More like what can be seen below:

     {
        "_index": "people",
        "_type": "course_ideal",
        "_id": "gglVnNnMQw6DGexWdoP5vg",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Andrew Ng",
              "coursename": [
                 {
                    "count": 5,
                    "name": "Machine Learning"
                 },
                 {
                    "count": 5,
                    "name": "Recommender Systems"
                 }
              ]
           }
        }
     },
     {
        "_index": "people",
        "_type": "course_ideal",
        "_id": "3uJQdZhVR5CDPTGELx9nMA",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Doug Cutting",
              "coursename": [
                 {
                    "count": 12,
                    "name": "Hadoop Internals"
                 },
                 {
                    "count": 25,
                    "name": "Basic of Lucene"
                 },
                 {
                    "count": 5,
                    "name": "Advanced Lucene"
                 },
                 {
                    "count": 5,
                    "name": "Introduction to Apache Avro"
                 }
              ]
           }
        }
     }

With the default approach, we are losing the grouping as well as some
content, as only distinct counts are being considered. I can understand why
this would happen, as the underlying JSON mapping scheme for both cases
would be the same and would lead to ambiguity (other cases would require
the current default):

{
"course": {
"properties": {
"person": {
"properties": {
"coursename": {
"properties": {
"count": {
"type": "long"
},
"name": {
"type": "string"
}
}
},
"name": {
"type": "string"
}
}
}
}
}
}

Can we do something in terms of the convention in specifying the query
itself, to group things better? Or am I missing a convention that already
exists?

Regards.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(George Ogata) #5

Jörg,

Thanks a ton - this is going to be extremely helpful for us.

I was wondering: should this allow nested documents with both nested and
non-nested array fields? For example, if I have this table:

mysql> select * from things;
+-----+---+---+
| _id | a | b |
+-----+---+---+
| 1 | 1 | 1 |
| 1 | 1 | 2 |
| 1 | 2 | 1 |
| 1 | 2 | 2 |
+-----+---+---+
4 rows in set (0.01 sec)

And this river:

$ curl -s localhost:9200/_river/things_things/_meta | jq .
{
"_source": {
"index": {
"type": "things",
"index": "things"
},
"jdbc": {
"autocommit": true,
"strategy": "oneshot",
"sql": "SELECT _id, a AS a[], b AS nested[b] FROM things ORDER BY
_id",
"password": null,
"user": "root",
"url": "jdbc:mysql://localhost:3306/haws",
"driver": "com.mysql.jdbc.Driver"
},
"type": "jdbc"
},
"exists": true,
"_version": 1,
"_id": "_meta",
"_type": "things_things",
"_index": "_river"
}

I end up with:

$ curl -s localhost:9200/things/things/1 | jq .
{
"_source": {
"a": [
"1",
"2"
],
"nested": [
{
"b": 1
},
{
"b": 2
},
{
"b": 1
},
{
"b": 2
}
]
},
"exists": true,
"_version": 1,
"_id": "1",
"_type": "things",
"_index": "things"
}

I'd like a way to get:

{
"_source": {
"a": [
"1",
"2"
],
"nested": [
{
"b": [1, 2]
},
]
},
"exists": true,
"_version": 1,
"_id": "1",
"_type": "things",
"_index": "things"
}

On Sun, Oct 13, 2013 at 6:24 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

I have released JDBC river version 2.2.2 where inner object array creation
is implemented, as described in the request.

https://github.com/jprante/elasticsearch-river-jdbc

To control the creation, special SQL column names with square bracket
notation are introduced. They define the name of an object in an array
list. If this object has already a value, a next array element is created.

Example:
https://github.com/jprante/elasticsearch-river-jdbc/blob/master/src/test/java/org/xbib/elasticsearch/river/jdbc/support/ValueListenerTests.java#L193

Jörg

On Thu, Oct 10, 2013 at 7:42 AM, Venkateshprasanna hmvprasanna@gmail.comwrote:

I have been exploring and using Elasticsearch for a couple of weeks now,
and coming from a background of exploring, using and tweaking Lucene, Nutch
and Solr for the last 7 years, I am mightily impressed with its simplicity
and elegance.

Of particular interest in our Elasticsearch usage so far has been the
JDBC river plugin, which is again, a great elegant extension. Thanks to
Jorg Prante and others who have contributed there.

We came across a situation that might be very common among those who are
trying to move content from their traditional RDBMS on to elasticsearch, by
first denormalizing at the DB through queries and passing it on through the
river as Structured Objects. Introduction to what is possible was very
clear from the tutorials:

https://github.com/jprante/elasticsearch-river-jdbc/wiki/Structured-Objects
2.
http://elasticsearch-users.115913.n3.nabble.com/Ann-JDBC-River-Plugin-for-ElasticSearch-td4019418.html

But the way things are grouped by default in the Structured Objects might
not help completely in all denormalization scenarios. Let me highlight with
an example:

This is the data I have:

Id Name Coursename TimesOffered
1 Andrew Ng Machine Learning 5
1 Andrew Ng Recommender Systems 5
2 Doug Cutting Hadoop Internals 12
2 Doug Cutting Basic of Lucene 25
2 Doug Cutting Advanced Lucene 5
2 Doug Cutting Introduction to Apache Avro 5

So, this data would go into the Structured Object through the river set
up:

{
"type": "jdbc",
"jdbc": {
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url": "jdbc:sqlserver://ServerName:1433;databaseName=DatabaseName",
"user": "UserName",
"password": "password",
"sql": "select name as [person.name], coursename as [
person.coursename.name], timesoffered as [person.coursename.count] from
courseofferings"
},
"index": {
"index": "people",
"type": "course",
}
}

The way the data gets stored would be like this:

     {
        "_index": "people",
        "_type": "course",
        "_id": "1",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Andrew Ng",
              "coursename": {
                 "count": 5,
                 "name": [
                    "Machine Learning",
                    "Recommender Systems"
                 ]
              }
           }
        }
     },
     {
        "_index": "people",
        "_type": "course",
        "_id": "2",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Doug Cutting",
              "coursename": {
                 "count": [
                    12,
                    25,
                    5
                 ],
                 "name": [
                    "Hadoop Internals",
                    "Basic of Lucene",
                    "Advanced Lucene",
                    "Introduction to Apache Avro"
                 ]
              }
           }
        }
     }

If you look at this carefully, what we wanted to do was to group on the
person (which is perfectly fine) and then group the course offered and the
number of times it was offered. That is, for *coursename *to have a
repeating inner structure of a combination of *name *and count, not
name and count separately as arrays. More like what can be seen below:

     {
        "_index": "people",
        "_type": "course_ideal",
        "_id": "gglVnNnMQw6DGexWdoP5vg",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Andrew Ng",
              "coursename": [
                 {
                    "count": 5,
                    "name": "Machine Learning"
                 },
                 {
                    "count": 5,
                    "name": "Recommender Systems"
                 }
              ]
           }
        }
     },
     {
        "_index": "people",
        "_type": "course_ideal",
        "_id": "3uJQdZhVR5CDPTGELx9nMA",
        "_score": 1,
        "_source": {
           "person": {
              "name": "Doug Cutting",
              "coursename": [
                 {
                    "count": 12,
                    "name": "Hadoop Internals"
                 },
                 {
                    "count": 25,
                    "name": "Basic of Lucene"
                 },
                 {
                    "count": 5,
                    "name": "Advanced Lucene"
                 },
                 {
                    "count": 5,
                    "name": "Introduction to Apache Avro"
                 }
              ]
           }
        }
     }

With the default approach, we are losing the grouping as well as some
content, as only distinct counts are being considered. I can understand why
this would happen, as the underlying JSON mapping scheme for both cases
would be the same and would lead to ambiguity (other cases would require
the current default):

{
"course": {
"properties": {
"person": {
"properties": {
"coursename": {
"properties": {
"count": {
"type": "long"
},
"name": {
"type": "string"
}
}
},
"name": {
"type": "string"
}
}
}
}
}
}

Can we do something in terms of the convention in specifying the query
itself, to group things better? Or am I missing a convention that already
exists?

Regards.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6