JDBC river generates duplicates


(benkunz) #1

Hi,
I am running elasticsearch 0.20.3 and jdbc river 2.0.2.

My goal is to be able to update my index on a regular basis AND also
manually when I need to. But is it doable if I set up a schedule for the
update ?

For the regular basis update I've done this :

curl -XPUT "localhost:9200/_river/article/_meta" -d "{"""type""" :
"""jdbc""","""jdbc""" : {"""strategy""" : """simple""","""driver""" :
"""com.mysql.jdbc.Driver""","""url""" :
"""jdbc:mysql://mysqlserver:3306/dvpt""","""user""" :
"""root""","""password""" : """dvpt""","""sql""" : """select * from
Article""","""poll""" : """10s"""},"""index""" : {"""index""" :
"""jdbc""","""type""" : """article"""}}"

Now if I go to http://localhost:9200/jdbc/article/_search?pretty I get this

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 840,
"max_score" : 1.0,
"hits" : [ {
"_index" : "jdbc",
"_type" : "article",
"_id" : "yLFEpD_uQLOqTuz1aetkWg",
"_score" : 1.0, "_source" : {"IDarticle":1,"designation":"La marque commercialise","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "e7EBYW_GQAy--9sqvY8-CA",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "QZ5m96PSR72DJYbQusc7XQ",
"_score" : 1.0, "_source" : {"IDarticle":3,"designation":"la production de collage","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "u6-9DBV7SOmuI0vIh1r8ZQ",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "MNiy_xoiTDubO8_95cs7Xg",
"_score" : 1.0, "_source" : {"IDarticle":7,"designation":"excel training","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "T-YdEBS9QhixgMP9tY2t7A",
"_score" : 1.0, "_source" : {"IDarticle":8,"designation":"added at 9:37","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "FBMwENQbTiWBTIJCuqLHeg",
"_score" : 1.0, "_source" : {"IDarticle":1,"designation":"La marque commercialise","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "ZIr-bfOqRQWDGpMyamXuQw",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "iZYJBXmnRsqITzBQU8wUIw",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "TcC5Va-5RzKIkNhQcxB0xw",
"_score" : 1.0, "_source" : {"IDarticle":6,"designation":"freshly added","IDuserSSO":1}
} ]
}
}

As you can see, I have duplicates in there. IDarticle=2 is there 4 times, IDarticle=1 is there twice... In my table Article, I of course dont have duplicates.

Could you also describe the difference between the strategy simple and
table ? When should I user which one and why ?

Thanks for you help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

In your SQL statement, use select IDarticle as _id, ... from ...

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 31 janv. 2013 à 11:39, benkunz benoit.kunz@me.com a écrit :

Hi,
I am running elasticsearch 0.20.3 and jdbc river 2.0.2.

My goal is to be able to update my index on a regular basis AND also manually when I need to. But is it doable if I set up a schedule for the update ?

For the regular basis update I've done this :

curl -XPUT "localhost:9200/_river/article/_meta" -d "{"""type""" : """jdbc""","""jdbc""" : {"""strategy""" : """simple""","""driver""" : """com.mysql.jdbc.Driver""","""url""" : """jdbc:mysql://mysqlserver:3306/dvpt""","""user""" : """root""","""password""" : """dvpt""","""sql""" : """select * from Article""","""poll""" : """10s"""},"""index""" : {"""index""" : """jdbc""","""type""" : """article"""}}"

Now if I go to http://localhost:9200/jdbc/article/_search?pretty I get this

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 840,
"max_score" : 1.0,
"hits" : [ {
"_index" : "jdbc",
"_type" : "article",
"_id" : "yLFEpD_uQLOqTuz1aetkWg",
"_score" : 1.0, "_source" : {"IDarticle":1,"designation":"La marque commercialise","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "e7EBYW_GQAy--9sqvY8-CA",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "QZ5m96PSR72DJYbQusc7XQ",
"_score" : 1.0, "_source" : {"IDarticle":3,"designation":"la production de collage","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "u6-9DBV7SOmuI0vIh1r8ZQ",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "MNiy_xoiTDubO8_95cs7Xg",
"_score" : 1.0, "_source" : {"IDarticle":7,"designation":"excel training","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "T-YdEBS9QhixgMP9tY2t7A",
"_score" : 1.0, "_source" : {"IDarticle":8,"designation":"added at 9:37","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "FBMwENQbTiWBTIJCuqLHeg",
"_score" : 1.0, "_source" : {"IDarticle":1,"designation":"La marque commercialise","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "ZIr-bfOqRQWDGpMyamXuQw",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "iZYJBXmnRsqITzBQU8wUIw",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "TcC5Va-5RzKIkNhQcxB0xw",
"_score" : 1.0, "_source" : {"IDarticle":6,"designation":"freshly added","IDuserSSO":1}
} ]
}
}

As you can see, I have duplicates in there. IDarticle=2 is there 4 times, IDarticle=1 is there twice... In my table Article, I of course dont have duplicates.

Could you also describe the difference between the strategy simple and table ? When should I user which one and why ?

Thanks for you help.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(benkunz) #3

great ! that did the trick. too bad I did not read it the doc. But is it
even mentioned ?

David, do you know if I can run manual updates even with strategy = simple ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

Jörg has done a real nice work on documentation:

See: https://github.com/jprante/elasticsearch-river-jdbc/wiki/Labeled-columns
About strategies: https://github.com/jprante/elasticsearch-river-jdbc/wiki/Strategies

That said, I think that you can run a manual update using another jdbc river with the same index/type target but with a oneshot strategy.

HTH

Le 31 janv. 2013 à 13:13, benkunz benoit.kunz@me.com a écrit :

great ! that did the trick. too bad I did not read it the doc. But is it even mentioned ?

David, do you know if I can run manual updates even with strategy = simple ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #5

Thanks David, for the helpful comments!

Manual triggering the river will be possible soon, now it's an
undocumented feature, it's a HTTP POST command.

Best regards,

Jörg

Am 31.01.13 14:07, schrieb David Pilato:

Jörg has done a real nice work on documentation:

See:
https://github.com/jprante/elasticsearch-river-jdbc/wiki/Labeled-columns
About strategies:
https://github.com/jprante/elasticsearch-river-jdbc/wiki/Strategies

That said, I think that you can run a manual update using another jdbc
river with the same index/type target but with a oneshot strategy.

HTH

Le 31 janv. 2013 à 13:13, benkunz <benoit.kunz@me.com
mailto:benoit.kunz@me.com> a écrit :

great ! that did the trick. too bad I did not read it the doc. But is
it even mentioned ?

David, do you know if I can run manual updates even with strategy =
simple ?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(benkunz) #6

Would you share with me the undocumented HTTP POST command ?
Benoit

Le 31 janv. 2013 à 15:30, Jörg Prante joergprante@gmail.com a écrit :

Thanks David, for the helpful comments!

Manual triggering the river will be possible soon, now it's an undocumented feature, it's a HTTP POST command.

Best regards,

Jörg

Am 31.01.13 14:07, schrieb David Pilato:

Jörg has done a real nice work on documentation:

See: https://github.com/jprante/elasticsearch-river-jdbc/wiki/Labeled-columns
About strategies: https://github.com/jprante/elasticsearch-river-jdbc/wiki/Strategies

That said, I think that you can run a manual update using another jdbc river with the same index/type target but with a oneshot strategy.

HTH

Le 31 janv. 2013 à 13:13, benkunz <benoit.kunz@me.com mailto:benoit.kunz@me.com> a écrit :

great ! that did the trick. too bad I did not read it the doc. But is it even mentioned ?

David, do you know if I can run manual updates even with strategy = simple ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(brunopaiucao) #7

Hi benkunz

I´ve seen the same symptom. How do you work around this problem?

Thanks
Bruno Paiuca

On Thursday, January 31, 2013 8:39:02 AM UTC-2, benkunz wrote:

Hi,
I am running elasticsearch 0.20.3 and jdbc river 2.0.2.

My goal is to be able to update my index on a regular basis AND also
manually when I need to. But is it doable if I set up a schedule for the
update ?

For the regular basis update I've done this :

curl -XPUT "localhost:9200/_river/article/_meta" -d "{"""type""" :
"""jdbc""","""jdbc""" : {"""strategy""" : """simple""","""driver""" :
"""com.mysql.jdbc.Driver""","""url""" :
"""jdbc:mysql://mysqlserver:3306/dvpt""","""user""" :
"""root""","""password""" : """dvpt""","""sql""" : """select * from
Article""","""poll""" : """10s"""},"""index""" : {"""index""" :
"""jdbc""","""type""" : """article"""}}"

Now if I go to http://localhost:9200/jdbc/article/_search?pretty I get
this

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 840,
"max_score" : 1.0,
"hits" : [ {
"_index" : "jdbc",
"_type" : "article",
"_id" : "yLFEpD_uQLOqTuz1aetkWg",
"_score" : 1.0, "_source" : {"IDarticle":1,"designation":"La marque commercialise","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "e7EBYW_GQAy--9sqvY8-CA",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "QZ5m96PSR72DJYbQusc7XQ",
"_score" : 1.0, "_source" : {"IDarticle":3,"designation":"la production de collage","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "u6-9DBV7SOmuI0vIh1r8ZQ",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "MNiy_xoiTDubO8_95cs7Xg",
"_score" : 1.0, "_source" : {"IDarticle":7,"designation":"excel training","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "T-YdEBS9QhixgMP9tY2t7A",
"_score" : 1.0, "_source" : {"IDarticle":8,"designation":"added at 9:37","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "FBMwENQbTiWBTIJCuqLHeg",
"_score" : 1.0, "_source" : {"IDarticle":1,"designation":"La marque commercialise","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "ZIr-bfOqRQWDGpMyamXuQw",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "iZYJBXmnRsqITzBQU8wUIw",
"_score" : 1.0, "_source" : {"IDarticle":2,"designation":"Ce composant issu de","IDuserSSO":1}
}, {
"_index" : "jdbc",
"_type" : "article",
"_id" : "TcC5Va-5RzKIkNhQcxB0xw",
"_score" : 1.0, "_source" : {"IDarticle":6,"designation":"freshly added","IDuserSSO":1}
} ]
}
}

As you can see, I have duplicates in there. IDarticle=2 is there 4 times, IDarticle=1 is there twice... In my table Article, I of course dont have duplicates.

Could you also describe the difference between the strategy simple and
table ? When should I user which one and why ?

Thanks for you help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #8

The JDBC river is far from perfect. For example, your suggestion is
fine, triggering a full refresh each half an hour, and get the current
generation of data into the index. The "simple" stragegy should be the
one to choose, but there are deficiences in the existing code, so I am
planning better strategies. My ideas so far:

  • there should be a method to create "generations" in ES where one
    generation of data is originating from a river poll at a certain point
    of time. I plan to achieve this by either a time stamp based poll where
    the document ES timestamp determines the generation membership, or
    index-wise with a generation counter/timestamp for a new index name
    (like logstash time-based indexing). The generation ID can be used to
    clean up old generations, and auto-switchover by index alias.

  • I plan to replace the interval waiting between runs with a better task
    scheduler (maybe a ScheduledExecutorService). In fact river polls should
    work like cron jobs configured by a crontab-like parameter. Each river
    poll should have to acquire a channel to run, which may be locked by
    currently active polls.

Also, I'm convinced these ideas could also be interesting for other
rivers because they are not specific to JDBC.

You see, such a plain task of moving data regurlarly to ES from an
external data source can easily grow and grow and I'm sure there is a
lot more to achieve :slight_smile:

Note, the "table strategy" is for the use case where a DBA wants to have
full control about the data being transferred in a DB tablem without
messing on the ES side with rivers and the like, but it is not
thoroughly tested.

If you can contribute other suggestions for a reliable JDBC river, just
let me know.

Jörg

Am 10.04.13 20:49, schrieb brunopaiucao@gmail.com:

Hi folks

I´ve seen the same symptom as benkunz. I´ve been try with Simple
Strategy because poll feature to update my index on a regular basis. I
tried use On-shot strategie but the index don´t update regular basis.
My RDBM table doesn´t refresh every time, but i want refresh the index
every half hour. I don´t know if i am right way, but i see something
informations about table strategy, but i can´t do it work.

What do you guess about a way that i due choose?
Is there any form to Simple strategy doesn´t duplicate data each poll
phase?
Do I need implement table strategy?
What is the HTTP Post command to refresh manually?

Jörg, congrat for the fine work.

Thanks
Bruno Paiuca

On Thursday, January 31, 2013 12:30:01 PM UTC-2, Jörg Prante wrote:

Thanks David, for the helpful comments!

Manual triggering the river will be possible soon, now it's an
undocumented feature, it's a HTTP POST command.

Best regards,

Jörg

Am 31.01.13 14:07, schrieb David Pilato:
> Jörg has done a real nice work on documentation:
>
> See:
>
https://github.com/jprante/elasticsearch-river-jdbc/wiki/Labeled-columns
<https://github.com/jprante/elasticsearch-river-jdbc/wiki/Labeled-columns>

> About strategies:
>
https://github.com/jprante/elasticsearch-river-jdbc/wiki/Strategies <https://github.com/jprante/elasticsearch-river-jdbc/wiki/Strategies>

>
> That said, I think that you can run a manual update using
another jdbc
> river with the same index/type target but with a oneshot strategy.
>
> HTH
>
>
> Le 31 janv. 2013 à 13:13, benkunz <benoi...@me.com
> <mailto:benoi...@me.com>> a écrit :
>
>> great ! that did the trick. too bad I did not read it the doc.
But is
>> it even mentioned ?
>>
>> David, do you know if I can run manual updates even with
strategy =
>> simple ?
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to elasticsearc...@googlegroups.com
>> <mailto:elasticsearch+unsubscribe@googlegroups.com>.
>> For more options, visit
https://groups.google.com/groups/opt_out
<https://groups.google.com/groups/opt_out>.
>>
>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from
it, send
> an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out
<https://groups.google.com/groups/opt_out>.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(appa) #9

Hey Jörg,

May be we can have JDBC integration or some kind of triggering using mysql log listener.

Thanks
--Appa


(system) #10