River-wikipedia does not index all pages

johtani · September 13, 2013, 5:26am

Hi all,

I use river-wikipedia to index Japanese wikipedia xml.
When I use parameter "bulk_size" : 10000, the number of indexed documents is 1540000.
But xml include 1546721 pages.

I have a question after seeing WikipediaRiver.java source code.

Probably, the reason is that PageCallback.processBulkIfNeeded() method index document only ,
if the number of buffering documents is more than bulkSize .

I suppose WikipediaRiver.close() method index the remainining documents,
only this method is useful deleting river settings.

What do you think about it?

Jun Ohtani
twitter : http://twitter.com/johtani

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · September 13, 2013, 6:05am

You're probably right. I fixed something like this in other rivers.
Could you open an issue in wikipedia river project?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 13 sept. 2013 à 07:26, Jun Ohtani johtani@gmail.com a écrit :

Hi all,

I use river-wikipedia to index Japanese wikipedia xml.
When I use parameter "bulk_size" : 10000, the number of indexed documents is 1540000.
But xml include 1546721 pages.

I have a question after seeing WikipediaRiver.java source code.

Probably, the reason is that PageCallback.processBulkIfNeeded() method index document only ,
if the number of buffering documents is more than bulkSize .

I suppose WikipediaRiver.close() method index the remainining documents,
only this method is useful deleting river settings.

What do you think about it?

Jun Ohtani
twitter : http://twitter.com/johtani

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

johtani · September 13, 2013, 6:25am

Hi David,

Thanks for replay.

I open a issue #12 in github project.

I have another idea.
Currently, processing bulk is only bulkSize check.
I suggest that PageCallback.processBulkIfNeeded process bulk after interval specified parameter, like a Solr auto-commit .
But I have no idea to implement this at the present time.

I will try to think about that a little as well.

Jun Ohtani
blog : http://blog.johtani.info

On 2013/09/13, at 15:05, David Pilato david@pilato.fr wrote:

You're probably right. I fixed something like this in other rivers.
Could you open an issue in wikipedia river project?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 13 sept. 2013 à 07:26, Jun Ohtani johtani@gmail.com a écrit :

Hi all,

I use river-wikipedia to index Japanese wikipedia xml.
When I use parameter "bulk_size" : 10000, the number of indexed documents is 1540000.
But xml include 1546721 pages.

I have a question after seeing WikipediaRiver.java source code.

Probably, the reason is that PageCallback.processBulkIfNeeded() method index document only ,
if the number of buffering documents is more than bulkSize .

I suppose WikipediaRiver.close() method index the remainining documents,
only this method is useful deleting river settings.

What do you think about it?

Jun Ohtani
twitter : http://twitter.com/johtani

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · September 13, 2013, 6:38am

We have now a nice BulkProcessor class which handle that properly.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 13 sept. 2013 à 08:25, Jun Ohtani johtani@gmail.com a écrit :

Hi David,

Thanks for replay.

I open a issue #12 in github project.

I have another idea.
Currently, processing bulk is only bulkSize check.
I suggest that PageCallback.processBulkIfNeeded process bulk after interval specified parameter, like a Solr auto-commit .
But I have no idea to implement this at the present time.

I will try to think about that a little as well.

Jun Ohtani
blog : http://blog.johtani.info

On 2013/09/13, at 15:05, David Pilato david@pilato.fr wrote:

You're probably right. I fixed something like this in other rivers.
Could you open an issue in wikipedia river project?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 13 sept. 2013 à 07:26, Jun Ohtani johtani@gmail.com a écrit :

Hi all,

I use river-wikipedia to index Japanese wikipedia xml.
When I use parameter "bulk_size" : 10000, the number of indexed documents is 1540000.
But xml include 1546721 pages.

I have a question after seeing WikipediaRiver.java source code.

Probably, the reason is that PageCallback.processBulkIfNeeded() method index document only ,
if the number of buffering documents is more than bulkSize .

I suppose WikipediaRiver.close() method index the remainining documents,
only this method is useful deleting river settings.

What do you think about it?

Jun Ohtani
twitter : http://twitter.com/johtani

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

johtani · September 13, 2013, 7:56am

Nice !

Thanks David.
I read deeply BulkProcessor class .

2013/9/13 David Pilato david@pilato.fr

We have now a nice BulkProcessor class which handle that properly.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 13 sept. 2013 à 08:25, Jun Ohtani johtani@gmail.com a écrit :

Hi David,

Thanks for replay.

I open a issue #12 in github project.

I have another idea.
Currently, processing bulk is only bulkSize check.
I suggest that PageCallback.processBulkIfNeeded process bulk after
interval specified parameter, like a Solr auto-commit .
But I have no idea to implement this at the present time.

I will try to think about that a little as well.

Jun Ohtani
blog : http://blog.johtani.info

On 2013/09/13, at 15:05, David Pilato david@pilato.fr wrote:

You're probably right. I fixed something like this in other rivers.
Could you open an issue in wikipedia river project?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 13 sept. 2013 à 07:26, Jun Ohtani johtani@gmail.com a écrit :

Hi all,

I use river-wikipedia to index Japanese wikipedia xml.
When I use parameter "bulk_size" : 10000, the number of indexed
documents is 1540000.
But xml include 1546721 pages.

I have a question after seeing WikipediaRiver.java source code.

Probably, the reason is that PageCallback.processBulkIfNeeded() method
index document only ,
if the number of buffering documents is more than bulkSize .

I suppose WikipediaRiver.close() method index the remainining documents,
only this method is useful deleting river settings.

What do you think about it?

Jun Ohtani
twitter : http://twitter.com/johtani

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Jun Ohtani
blog : http://blog.johtani.info

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.