Elasticsearch-jdbc plugin fetching only part of data when nesting documents


(Anna Skawińska) #1

Hi everyone,

I have successfully employed elasticsearch-jdbc plugin to pump my data (in river mode) from a Postgres db. Everything ran smoothly until I tried to map my data from joined tables into nested documents - I found the river to be fetching only a part of the embedded objects.

The SQL query, run independently from my postgres console, yields 15 rows. In ES index there seem to be only 6 of them merged into nested documents. (even though ES console claims to have imported 15 rows at a time)

Have I messed up my sql query or is it some bug in elasticsearch-jdbc merging the documents?

Cheers,

Anna

  • river definition:

    curl -XPUT 'localhost:9200/_river/nexit_river/_meta' --data @my_river.json

  • my_river.json:

    {
    "type": "jdbc",
    "jdbc": [

    {
    "url": "jdbc:postgresql://localhost/hello",
    "user": "hi",
    "sql": "SELECT
    v1.id as _id,
    c.id || 'c' as "tag[id_type]"
    FROM exit e
    INNER JOIN
    venue v1 ON e.id = v1.exit_id
    INNER JOIN category_venue cv ON v1.id = cv.venue_id
    INNER JOIN category c on c.id = cv.category_id
    UNION SELECT
    v2.id as _id, a.id || 'a' as "tag[id_type]"
    FROM exit e
    INNER JOIN venue v2 ON e.id = v2.exit_id
    INNER JOIN amenity a ON v2.id = a.venue_id",
    "password": "yo",
    "treat_binary_as_string": true,
    "index": "hello",
    "type": "tags",
    "fetchsize" : 100000
    }
    ],
    "schedule": "0 0-59 0-23 ? * *"
    }

  • SQL query results:

_id; tag[id_type]

1;"2c"
1;"4c"
1;"22a"
1;"1a"
2;"3c"
3;"3c"
1;"2a"
2;"1c"
3;"1c"
4;"4c"
2;"2c"
2;"4c"
3;"4c"
1;"3c"
1;"1c"
  • elasticsearch query result:

    curl 'localhost:9200/hello/tags/_search?pretty'
    {

       "took" : 21,
       "timed_out" : false,
       "_shards" : {
         "total" : 5,
         "successful" : 5,
         "failed" : 0
       },
       "hits" : {
         "total" : 4,
         "max_score" : 1.0,
         "hits" : [ {
           "_index" : "hello",
           "_type" : "tags",
           "_id" : "4",
           "_score" : 1.0,
           "_source":{"tag":[{"id_type":"4c"}]}
         }, {
           "_index" : "hello",
           "_type" : "tags",
           "_id" : "1",
           "_score" : 1.0,
           "_source":{"tag":[{"id_type":"3c"},{"id_type":"1c"}]}
         }, {
           "_index" : "hello",
           "_type" : "tags",
           "_id" : "2",
           "_score" : 1.0,
           "_source":{"tag":[{"id_type":"2c"},{"id_type":"4c"}]}
         }, {
           "_index" : "hello",
           "_type" : "tags",
           "_id" : "3",
           "_score" : 1.0,
           "_source":{"tag":[{"id_type":"4c"}]}
         } ]
       }
     }
    
  • Elasticsearch console:

    [2015-05-13 09:25:00,089][INFO ][river.jdbc.RiverMetrics ] pipeline org.xbib.elasticsearch.plugin.jdbc.RiverPipeline@7ddc13ca complete: river jdbc/hello_river metrics: 15 rows, 0.2862853578324369 mean, (0.0 0.0 0.0), ingest metrics: elapsed 0 seconds, 846.0 bytes bytes, 52.0 bytes avg, 0.009 MB/s
    [2015-05-13 09:26:00,002][INFO ][index.shard ] [Kraken] [hello][0] updating refresh_interval from [1s] to [-1]
    [2015-05-13 09:26:00,002][INFO ][index.shard ] [Kraken] [hello][1] updating refresh_interval from [1s] to [-1]
    [2015-05-13 09:26:00,002][INFO ][index.shard ] [Kraken] [hello][2] updating refresh_interval from [1s] to [-1]
    [2015-05-13 09:26:00,002][INFO ][index.shard ] [Kraken] [hello][3] updating refresh_interval from [1s] to [-1]
    [2015-05-13 09:26:00,002][INFO ][index.shard ] [Kraken] [hello][4] updating refresh_interval from [1s] to [-1]
    [2015-05-13 09:26:00,048][INFO ][index.shard ] [Kraken] [hello][0] updating refresh_interval from [-1] to [1s]
    [2015-05-13 09:26:00,048][INFO ][index.shard ] [Kraken] [hello][1] updating refresh_interval from [-1] to [1s]
    [2015-05-13 09:26:00,048][INFO ][index.shard ] [Kraken] [hello][2] updating refresh_interval from [-1] to [1s]
    [2015-05-13 09:26:00,048][INFO ][index.shard ] [Kraken] [hello][3] updating refresh_interval from [-1] to [1s]
    [2015-05-13 09:26:00,048][INFO ][index.shard ] [Kraken] [hello][4] updating refresh_interval from [-1] to [1s]
    [2015-05-13 09:26:00,089][INFO ][river.jdbc.RiverMetrics ] pipeline org.xbib.elasticsearch.plugin.jdbc.RiverPipeline@a23e789 complete: river jdbc/hello_river metrics: 30 rows, 0.2669148855001983 mean, (0.0 0.0 0.0), ingest metrics: elapsed 0 seconds, 846.0 bytes bytes, 27.0 bytes avg, 0.009 MB/s


(Jörg Prante) #2

In the rows from the SQL result set, the ids are not monotonic.

For example, the doc id 1 appears three times, at the beginning (with "2c","4c","22a","1a"), in the middle (with "2a"), and at the end (with "3c", "2c").

The order of rows (the doc ids) is important for correct JSON doc construction. The indexing process is not smart and can not reorder the rows for you. For this, you must use SQL ORDER BY clause.


(Anna Skawińska) #3

Many thanks! Worked like a charm.
But then, I assume, updating the nested documents in case the join results change, won't work that easily either?

For documentational purposes, what has helped was:

select * from (
SELECT
      v1.id as _id,
      c.id || 'c' as "tag[id_type]"
      FROM exit e
      INNER JOIN
      venue v1 ON e.id = v1.exit_id
      INNER JOIN category_venue cv ON v1.id = cv.venue_id
      INNER JOIN category c on c.id = cv.category_id
      UNION SELECT
      v2.id as _id, a.id || 'a' as "tag[id_type]"
      FROM exit e
      INNER JOIN venue v2 ON e.id = v2.exit_id
      INNER JOIN amenity a ON v2.id = a.venue_id
      ) result
      order by result._id

(Jörg Prante) #4

Each doc id is indexed as a new document. Updating a doc by id is not implemented.


(system) #5