Elasticsearch-jdbc plugin fetching only part of data when nesting documents

askawinska · May 13, 2015, 7:58am

Hi everyone,

I have successfully employed elasticsearch-jdbc plugin to pump my data (in river mode) from a Postgres db. Everything ran smoothly until I tried to map my data from joined tables into nested documents - I found the river to be fetching only a part of the embedded objects.

The SQL query, run independently from my postgres console, yields 15 rows. In ES index there seem to be only 6 of them merged into nested documents. (even though ES console claims to have imported 15 rows at a time)

Have I messed up my sql query or is it some bug in elasticsearch-jdbc merging the documents?

Cheers,

Anna

river definition:

curl -XPUT 'localhost:9200/_river/nexit_river/_meta' --data @my_river.json
my_river.json:

{
"type": "jdbc",
"jdbc": [

{
"url": "jdbc:postgresql://localhost/hello",
"user": "hi",
"sql": "SELECT
v1.id as _id,
c.id || 'c' as "tag[id_type]"
FROM exit e
INNER JOIN
venue v1 ON e.id = v1.exit_id
INNER JOIN category_venue cv ON v1.id = cv.venue_id
INNER JOIN category c on c.id = cv.category_id
UNION SELECT
v2.id as _id, a.id || 'a' as "tag[id_type]"
FROM exit e
INNER JOIN venue v2 ON e.id = v2.exit_id
INNER JOIN amenity a ON v2.id = a.venue_id",
"password": "yo",
"treat_binary_as_string": true,
"index": "hello",
"type": "tags",
"fetchsize" : 100000
}
],
"schedule": "0 0-59 0-23 ? * *"
}
SQL query results:

_id; tag[id_type]

1;"2c"
1;"4c"
1;"22a"
1;"1a"
2;"3c"
3;"3c"
1;"2a"
2;"1c"
3;"1c"
4;"4c"
2;"2c"
2;"4c"
3;"4c"
1;"3c"
1;"1c"

elasticsearch query result:

curl 'localhost:9200/hello/tags/_search?pretty'
{

   "took" : 21,
   "timed_out" : false,
   "_shards" : {
     "total" : 5,
     "successful" : 5,
     "failed" : 0
   },
   "hits" : {
     "total" : 4,
     "max_score" : 1.0,
     "hits" : [ {
       "_index" : "hello",
       "_type" : "tags",
       "_id" : "4",
       "_score" : 1.0,
       "_source":{"tag":[{"id_type":"4c"}]}
     }, {
       "_index" : "hello",
       "_type" : "tags",
       "_id" : "1",
       "_score" : 1.0,
       "_source":{"tag":[{"id_type":"3c"},{"id_type":"1c"}]}
     }, {
       "_index" : "hello",
       "_type" : "tags",
       "_id" : "2",
       "_score" : 1.0,
       "_source":{"tag":[{"id_type":"2c"},{"id_type":"4c"}]}
     }, {
       "_index" : "hello",
       "_type" : "tags",
       "_id" : "3",
       "_score" : 1.0,
       "_source":{"tag":[{"id_type":"4c"}]}
     } ]
   }
 }

Elasticsearch console:

[2015-05-13 09:25:00,089][INFO ][river.jdbc.RiverMetrics ] pipeline org.xbib.elasticsearch.plugin.jdbc.RiverPipeline@7ddc13ca complete: river jdbc/hello_river metrics: 15 rows, 0.2862853578324369 mean, (0.0 0.0 0.0), ingest metrics: elapsed 0 seconds, 846.0 bytes bytes, 52.0 bytes avg, 0.009 MB/s
[2015-05-13 09:26:00,002][INFO ][index.shard ] [Kraken] [hello][0] updating refresh_interval from [1s] to [-1]
[2015-05-13 09:26:00,002][INFO ][index.shard ] [Kraken] [hello][1] updating refresh_interval from [1s] to [-1]
[2015-05-13 09:26:00,002][INFO ][index.shard ] [Kraken] [hello][2] updating refresh_interval from [1s] to [-1]
[2015-05-13 09:26:00,002][INFO ][index.shard ] [Kraken] [hello][3] updating refresh_interval from [1s] to [-1]
[2015-05-13 09:26:00,002][INFO ][index.shard ] [Kraken] [hello][4] updating refresh_interval from [1s] to [-1]
[2015-05-13 09:26:00,048][INFO ][index.shard ] [Kraken] [hello][0] updating refresh_interval from [-1] to [1s]
[2015-05-13 09:26:00,048][INFO ][index.shard ] [Kraken] [hello][1] updating refresh_interval from [-1] to [1s]
[2015-05-13 09:26:00,048][INFO ][index.shard ] [Kraken] [hello][2] updating refresh_interval from [-1] to [1s]
[2015-05-13 09:26:00,048][INFO ][index.shard ] [Kraken] [hello][3] updating refresh_interval from [-1] to [1s]
[2015-05-13 09:26:00,048][INFO ][index.shard ] [Kraken] [hello][4] updating refresh_interval from [-1] to [1s]
[2015-05-13 09:26:00,089][INFO ][river.jdbc.RiverMetrics ] pipeline org.xbib.elasticsearch.plugin.jdbc.RiverPipeline@a23e789 complete: river jdbc/hello_river metrics: 30 rows, 0.2669148855001983 mean, (0.0 0.0 0.0), ingest metrics: elapsed 0 seconds, 846.0 bytes bytes, 27.0 bytes avg, 0.009 MB/s

jprante · May 13, 2015, 9:08am

In the rows from the SQL result set, the ids are not monotonic.

For example, the doc id 1 appears three times, at the beginning (with "2c","4c","22a","1a"), in the middle (with "2a"), and at the end (with "3c", "2c").

The order of rows (the doc ids) is important for correct JSON doc construction. The indexing process is not smart and can not reorder the rows for you. For this, you must use SQL ORDER BY clause.

askawinska · May 13, 2015, 11:02am

Many thanks! Worked like a charm.
But then, I assume, updating the nested documents in case the join results change, won't work that easily either?

For documentational purposes, what has helped was:

select * from (
SELECT
      v1.id as _id,
      c.id || 'c' as "tag[id_type]"
      FROM exit e
      INNER JOIN
      venue v1 ON e.id = v1.exit_id
      INNER JOIN category_venue cv ON v1.id = cv.venue_id
      INNER JOIN category c on c.id = cv.category_id
      UNION SELECT
      v2.id as _id, a.id || 'a' as "tag[id_type]"
      FROM exit e
      INNER JOIN venue v2 ON e.id = v2.exit_id
      INNER JOIN amenity a ON v2.id = a.venue_id
      ) result
      order by result._id

jprante · May 13, 2015, 12:07pm

Each doc id is indexed as a new document. Updating a doc by id is not implemented.

Topic		Replies	Views
[Ann] JDBC River Plugin for ElasticSearch Elasticsearch	20	2694	July 6, 2017
Elasticsearch river for postgres Elasticsearch	10	600	July 6, 2017
[ANN] JDBC river 1.0.0.1 released Elasticsearch	6	383	July 6, 2017
ES jdbc river problem Elasticsearch	1	351	July 6, 2017
[ANN] JDBC plugin 1.1.0.2 Elasticsearch	1	296	July 6, 2017

Elasticsearch-jdbc plugin fetching only part of data when nesting documents

Related topics